A high-performance Python DataFrame library for lazy out-of-core processing and visualization of billion-row datasets at interactive speeds.
Vaex is a high-performance Python library for lazy Out-of-Core DataFrames that enables visualization and exploration of massive tabular datasets. It calculates statistics on N-dimensional grids for over a billion rows per second using memory mapping and zero-copy policies to avoid memory waste. The library supports interactive exploration through histograms, density plots, and 3D volume rendering.
Data scientists, researchers, and engineers working with large tabular datasets (billions of rows) who need efficient data processing and visualization on a single machine without moving to distributed clusters.
Vaex offers unparalleled performance for big data tasks on a laptop by leveraging memory mapping, lazy computations, and efficient algorithms, avoiding the overhead of copying data and enabling interactive exploration of datasets that would otherwise require cluster computing.
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Vaex memory maps HDF5 and Apache Arrow files for instant opening of huge datasets without RAM loading, as shown in README images demonstrating fast access to multi-gigabyte files.
Its lazy expression system and zero-copy policy avoid memory waste by transforming data on-demand and streaming only during computations, enabling out-of-core operations directly on disk.
Parallelized groupby operations exceed a billion rows per second, especially with categorical data, allowing rapid statistical calculations on massive tabular datasets.
Joins large tables without materializing the right table, saving gigabytes of memory and achieving subsecond performance on billion-row datasets, as highlighted in the README.
Optimal performance requires data in HDF5 or Apache Arrow formats; converting from CSV or other sources adds overhead, and support for real-time streaming beyond S3 is minimal.
Compared to Pandas, Vaex has a smaller ecosystem of compatible libraries and tools, which can limit integration with niche data science workflows or require custom adaptations.
Advanced features like Remote DataFrames have incomplete documentation (noted as 'coming soon' in the README), potentially hindering adoption for complex use cases.