Image by author
# Introduction
Handling huge datasets containing billions of rows is a major challenge in data science and analytics. Traditional tools like Panda The systems work well for small to medium datasets that fit in memory, but as the dataset size increases, they become slow, use large amounts of random access memory (RAM) to function, and often crash with out of memory (OOM) errors.
right here waxComes with a high-performance Python library for out-of-core data processing. VAX lets you examine, modify, visualize, and analyze large tabular datasets efficiently and memory-friendly, even on a standard laptop.
# What is wax?
Vax is a Python library for the lazy, out-of-core DataFrames (Similar to Panda) Designed for data larger than your RAM.
Key Features: :
Vaex is designed to efficiently handle large datasets by working directly with the data on disk and reading only the necessary parts, avoiding loading entire files into memory.
Vax uses lazy evaluation, meaning operations are calculated only when results are actually requested, and it can open columnar databases – which store data by columns rather than rows – like HDF5, Apache Arrow, and Parquet instantly through memory mapping.
Built on an optimized C/C++ backend, Vaex can calculate statistics and perform operations on billions of rows per second, accelerating large-scale analysis even on modest hardware.
It has a Pandas-like application programming interface (API) that eases the transition for users already familiar with Pandas, helping them take advantage of big data capabilities without a steep learning curve.
# comparison of wax and dusk
not the same as wax dusk overall but it’s similar to dusk DataFrameswhich are built on top of pandas DataFrames. This means that Dask inherits some Pandas issues, such as the requirement that data be completely loaded into RAM in order to be processed in some contexts. This is not the case with wax. does not make wax DataFrame copy, so that it can process larger DataFrames On machines with less main memory. Vaex and Dashk both use lazy processing. The primary difference is that Vax calculates fields only when required, whereas with Dask, we need to explicitly call compute() Celebration. To take full advantage of Vaex, data needs to be in HDF5 or Apache Arrow format.
# Why do traditional tools struggle?
Tools like pandas load the entire dataset into RAM before processing. For datasets larger than memory, this results in:
- slow performance
- System Crash (OOM Errors)
- limited interactivity
Vaex never loads the entire dataset into memory; Instead, this:
- Streams data from disk
- Uses virtual columns and lazy evaluation to delay calculations
- Results emerge only when clearly required
This enables analysis of large datasets even on modest hardware.
# How does wax work under the hood
// out-of-core execution
VAX reads data from disk as needed using memory mapping. This allows it to operate on large data files far beyond the capacity of RAM.
// lazy evaluation
Instead of executing each operation immediately, VAX creates a computation graph. The calculation is only executed when you request a result (for example when printing or plotting).
// virtual column
Virtual columns are expressions defined on a dataset that do not occupy memory until calculated. This saves RAM and speeds up workflow.
# Getting Started with Wax
// setting wax
Create a clean virtual environment:
conda create -n vaex_demo python=3.9
conda activate vaex_demo
Install Vaex with pip: :
pip install vaex-core vaex-hdf5 vaex-viz
Upgrade VAX:
pip install --upgrade vaex
Install supporting libraries:
pip install pandas numpy matplotlib
// opening large datasets
Vaex supports various popular storage formats to handle large datasets. It can work directly with HDF5, Apache Arrow and Parquet files, all optimized for efficient disk access and fast analytics. While Vaex can also read CSV files, it needs to be converted to a more efficient format first to improve performance when working with larger datasets.
How to open a Parquet file:
import vaex
df = vaex.open("your_huge_dataset.parquet")
print(df)
You can now inspect the dataset structure without loading it into memory.
// Main operations in wax
Filtering data:
filtered = df(df.sales > 1000)
It does not calculate the result immediately; Instead, filters are registered and applied only when needed.
Group-by and Aggregation:
result = df.groupby("category", agg=vaex.agg.mean("sales"))
print(result)
VAX computes aggregations efficiently using parallel algorithms and minimal memory.
Computing statistics:
mean_price = df("price").mean()
print(mean_price)
VAX scans the dataset in chunks and calculates it instantly.
// Performance with taxi dataset
We will create a realistic 50 million row taxi dataset to demonstrate the capabilities of VAX:
import vaex
import numpy as np
import pandas as pd
import time
Set random seed for reproducibility:
np.random.seed(42)
print("Creating 50 million row dataset...")
n = 50_000_000
Generate realistic taxi trip data:
data = {
'passenger_count': np.random.randint(1, 7, n),
'trip_distance': np.random.exponential(3, n),
'fare_amount': np.random.gamma(10, 1.5, n),
'tip_amount': np.random.gamma(2, 1, n),
'total_amount': np.random.gamma(12, 1.8, n),
'payment_type': np.random.choice(('credit', 'cash', 'mobile'), n),
'pickup_hour': np.random.randint(0, 24, n),
'pickup_day': np.random.randint(1, 8, n),
}
make wax DataFrame: :
df_vaex = vaex.from_dict(data)
Export to HDF5 format (efficient for Vaex):
df_vaex.export_hdf5('taxi_50M.hdf5')
print(f"Created dataset with {n:,} rows")
Output:
Shape: (50000000, 8)
Created dataset with 50,000,000 rows
Now we have a 50 million row dataset with 8 columns.
// vax vs panda performance
Opening large files with Vaex memory-mapped opening:
start = time.time()
df_vaex = vaex.open('taxi_50M.hdf5')
vaex_time = time.time() - start
print(f"Vaex opened {df_vaex.shape(0):,} rows in {vaex_time:.4f} seconds")
print(f"Memory usage: ~0 MB (memory-mapped)")
Output:
Vaex opened 50,000,000 rows in 0.0199 seconds
Memory usage: ~0 MB (memory-mapped)
Pandas: Load into memory (don’t try this with 50M rows!):
# This would fail on most machines
df_pandas = pd.read_hdf('taxi_50M.hdf5')
This will result in a memory error! Vaex opens files almost instantly, regardless of size, because it does not load data into memory.
Basic Aggregation: Calculate statistics on 50 million rows:
start = time.time()
stats = {
'mean_fare': df_vaex.fare_amount.mean(),
'mean_distance': df_vaex.trip_distance.mean(),
'total_revenue': df_vaex.total_amount.sum(),
'max_fare': df_vaex.fare_amount.max(),
'min_fare': df_vaex.fare_amount.min(),
}
agg_time = time.time() - start
print(f"nComputed 5 aggregations in {agg_time:.4f} seconds:")
print(f" Mean fare: ${stats('mean_fare'):.2f}")
print(f" Mean distance: {stats('mean_distance'):.2f} miles")
print(f" Total revenue: ${stats('total_revenue'):,.2f}")
print(f" Fare range: ${stats('min_fare'):.2f} - ${stats('max_fare'):.2f}")
Output:
Computed 5 aggregations in 0.8771 seconds:
Mean fare: $15.00
Mean distance: 3.00 miles
Total revenue: $1,080,035,827.27
Fare range: $1.25 - $55.30
Filtering functions: Filter long trips:
start = time.time()
long_trips = df_vaex(df_vaex.trip_distance > 10)
filter_time = time.time() - start
print(f"nFiltered for trips > 10 miles in {filter_time:.4f} seconds")
print(f" Found: {len(long_trips):,} long trips")
print(f" Percentage: {(len(long_trips)/len(df_vaex)*100):.2f}%")
Output:
Filtered for trips > 10 miles in 0.0486 seconds
Found: 1,784,122 long trips
Percentage: 3.57%
Multiple Situations:
start = time.time()
premium_trips = df_vaex((df_vaex.trip_distance > 5) &
(df_vaex.fare_amount > 20) &
(df_vaex.payment_type == 'credit'))
multi_filter_time = time.time() - start
print(f"nMultiple condition filter in {multi_filter_time:.4f} seconds")
print(f" Premium trips (>5mi, >$20, credit): {len(premium_trips):,}")
Output:
Multiple condition filter in 0.0582 seconds
Premium trips (>5mi, >$20, credit): 457,191
Group-by-operation:
start = time.time()
by_payment = df_vaex.groupby('payment_type', agg={
'mean_fare': vaex.agg.mean('fare_amount'),
'mean_tip': vaex.agg.mean('tip_amount'),
'total_trips': vaex.agg.count(),
'total_revenue': vaex.agg.sum('total_amount')
})
groupby_time = time.time() - start
print(f"nGroupBy operation in {groupby_time:.4f} seconds")
print(by_payment.to_pandas_df())
Output:
GroupBy operation in 5.6362 seconds
payment_type mean_fare mean_tip total_trips total_revenue
0 credit 15.001817 2.000065 16663623 3.599456e+08
1 mobile 15.001200 1.999679 16667691 3.600165e+08
2 cash 14.999397 2.000115 16668686 3.600737e+08
More complex group-rates:
start = time.time()
by_hour = df_vaex.groupby('pickup_hour', agg={
'avg_distance': vaex.agg.mean('trip_distance'),
'avg_fare': vaex.agg.mean('fare_amount'),
'trip_count': vaex.agg.count()
})
complex_groupby_time = time.time() - start
print(f"nGroupBy by hour in {complex_groupby_time:.4f} seconds")
print(by_hour.to_pandas_df().head(10))
Output:
GroupBy by hour in 1.6910 seconds
pickup_hour avg_distance avg_fare trip_count
0 0 2.998120 14.997462 2083481
1 1 3.000969 14.998814 2084650
2 2 3.003834 15.001777 2081962
3 3 3.001263 14.998196 2081715
4 4 2.998343 14.999593 2083882
5 5 2.997586 15.003988 2083421
6 6 2.999887 15.011615 2083213
7 7 3.000240 14.996892 2085156
8 8 3.002640 15.000326 2082704
9 9 2.999857 14.997857 2082284
// Advanced Wax Features
Virtual columns (calculated columns) allow adding columns without any data copying:
df_vaex('tip_percentage') = (df_vaex.tip_amount / df_vaex.fare_amount) * 100
df_vaex('is_generous_tipper') = df_vaex.tip_percentage > 20
df_vaex('rush_hour') = (df_vaex.pickup_hour >= 7) & (df_vaex.pickup_hour <= 9) |
(df_vaex.pickup_hour >= 17) & (df_vaex.pickup_hour <= 19)
These are calculated instantly without any memory overhead:
print("Added 3 virtual columns with zero memory overhead")
generous_tippers = df_vaex(df_vaex.is_generous_tipper)
print(f"Generous tippers (>20% tip): {len(generous_tippers):,}")
rush_hour_trips = df_vaex(df_vaex.rush_hour)
print(f"Rush hour trips: {len(rush_hour_trips):,}")
Output:
VIRTUAL COLUMNS
Added 3 virtual columns with zero memory overhead
Generous tippers (>20% tip): 11,997,433
Rush hour trips: 12,498,848
correlation analysis:
corr = df_vaex.correlation(df_vaex.trip_distance, df_vaex.fare_amount)
print(f"Correlation (distance vs fare): {corr:.4f}")
Percentile:
try:
percentiles = df_vaex.percentile_approx('fare_amount', (25, 50, 75, 90, 95, 99))
except AttributeError:
percentiles = (
df_vaex.fare_amount.quantile(0.25),
df_vaex.fare_amount.quantile(0.50),
df_vaex.fare_amount.quantile(0.75),
df_vaex.fare_amount.quantile(0.90),
df_vaex.fare_amount.quantile(0.95),
df_vaex.fare_amount.quantile(0.99),
)
print(f"nFare percentiles:")
print(f"25th: ${percentiles(0):.2f}")
print(f"50th (median): ${percentiles(1):.2f}")
print(f"75th: ${percentiles(2):.2f}")
print(f"90th: ${percentiles(3):.2f}")
print(f"95th: ${percentiles(4):.2f}")
print(f"99th: ${percentiles(5):.2f}")
Standard Deviation:
std_fare = df_vaex.fare_amount.std()
print(f"nStandard deviation of fares: ${std_fare:.2f}")
Additional useful statistics:
print(f"nAdditional statistics:")
print(f"Mean: ${df_vaex.fare_amount.mean():.2f}")
print(f"Min: ${df_vaex.fare_amount.min():.2f}")
print(f"Max: ${df_vaex.fare_amount.max():.2f}")
Output:
Correlation (distance vs fare): -0.0001
Fare percentiles:
25th: $11.57
50th (median): $nan
75th: $nan
90th: $nan
95th: $nan
99th: $nan
Standard deviation of fares: $4.74
Additional statistics:
Mean: $15.00
Min: $1.25
Max: $55.30
// data export
# Export filtered data
high_value_trips = df_vaex(df_vaex.total_amount > 50)
Exporting to different formats:
start = time.time()
high_value_trips.export_hdf5('high_value_trips.hdf5')
export_time = time.time() - start
print(f"Exported {len(high_value_trips):,} rows to HDF5 in {export_time:.4f}s")
You can also export to CSV, Parquet, etc:
high_value_trips.export_csv('high_value_trips.csv')
high_value_trips.export_parquet('high_value_trips.parquet')
Output:
Exported 13,054 rows to HDF5 in 5.4508s
Performance Summary Dashboard
print("VAEX PERFORMANCE SUMMARY")
print(f"Dataset size: {n:,} rows")
print(f"File size on disk: ~2.4 GB")
print(f"RAM usage: ~0 MB (memory-mapped)")
print()
print(f"Open time: {vaex_time:.4f} seconds")
print(f"Single aggregation: {agg_time:.4f} seconds")
print(f"Simple filter: {filter_time:.4f} seconds")
print(f"Complex filter: {multi_filter_time:.4f} seconds")
print(f"GroupBy operation: {groupby_time:.4f} seconds")
print()
print(f"Throughput: ~{n/groupby_time:,.0f} rows/second")
Output:
VAEX PERFORMANCE SUMMARY
Dataset size: 50,000,000 rows
File size on disk: ~2.4 GB
RAM usage: ~0 MB (memory-mapped)
Open time: 0.0199 seconds
Single aggregation: 0.8771 seconds
Simple filter: 0.0486 seconds
Complex filter: 0.0582 seconds
GroupBy operation: 5.6362 seconds
Throughput: ~8,871,262 rows/second
# closing thoughts
Vaex is ideal when you’re working with large datasets that exceed 1GB and don’t fit in RAM, exploring big data, doing feature engineering with millions of rows, or building data preprocessing pipelines.
You should not use Vaex for datasets smaller than 100 MB. For these, it’s easy to use pandas. If you are dealing with complex joins across multiple tables, it may be better to use a structured query language (SQL) database. When you need the full Pandas API, note that Vax has limited compatibility. For real-time streaming data, other tools are more suitable.
Vax fills a gap in the Python data science ecosystem: the ability to efficiently and interactively work on billion-row datasets without loading everything into memory. Its out-of-core architecture, lazy execution model, and optimized algorithms make it a powerful tool for big data exploration even on laptops. Whether you’re exploring large-scale logs, scientific surveys, or high-frequency time series, VAX helps bridge the gap between ease of use and big data scalability.
Shittu Olumide He is a software engineer and technical writer who is passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and the ability to simplify complex concepts. You can also find Shittu Twitter.
