Image by author
# Introduction
If you’ve been working with data in Python, you’ve almost certainly used Panda. It has been the library of choice for data manipulation for over a decade. But recently, polar Is gaining serious traction. Polar promises to be faster, more memory-efficient, and more intuitive than Panda. But is it worth learning? And how different is it really?
In this article, we will compare Panda and Polar side by side. You’ll see performance benchmarks, and learn syntax differences. By the end, you will be able to make an informed decision for your next data project.
You can find the code on GitHub.
# launch
Let’s install both libraries first:
pip install pandas polars
Comment:This article uses pandas 2.2.2 and Polar 1.31.0.
For this comparison, we will also use a dataset that is large enough to see real performance differences. we will use swindler To generate test data:
Now we are ready to start coding.
# Measuring speed by reading large CSV files
Let’s start with one of the most common operations: reading a CSV file. To see the real performance difference we will create a dataset with 1 million rows.
First, let’s prepare our sample data:
import pandas as pd
from faker import Faker
import random
# Generate a large CSV file for testing
fake = Faker()
Faker.seed(42)
random.seed(42)
data = {
'user_id': range(1000000),
'name': (fake.name() for _ in range(1000000)),
'email': (fake.email() for _ in range(1000000)),
'age': (random.randint(18, 80) for _ in range(1000000)),
'salary': (random.randint(30000, 150000) for _ in range(1000000)),
'department': (random.choice(('Engineering', 'Sales', 'Marketing', 'HR', 'Finance'))
for _ in range(1000000))
}
df_temp = pd.DataFrame(data)
df_temp.to_csv('large_dataset.csv', index=False)
print("✓ Generated large_dataset.csv with 1M rows")
This code creates a CSV file with realistic data. Let’s now compare reading speeds:
import pandas as pd
import polars as pl
import time
# pandas: Read CSV
start = time.time()
df_pandas = pd.read_csv('large_dataset.csv')
pandas_time = time.time() - start
# Polars: Read CSV
start = time.time()
df_polars = pl.read_csv('large_dataset.csv')
polars_time = time.time() - start
print(f"Pandas read time: {pandas_time:.2f} seconds")
print(f"Polars read time: {polars_time:.2f} seconds")
print(f"Polars is {pandas_time/polars_time:.1f}x faster")
Sample output when reading CSV:
Pandas read time: 1.92 seconds
Polars read time: 0.23 seconds
Polars is 8.2x faster
Here’s what’s happening: We time how long it takes for each library to read the same CSV file. While Pandas uses its traditional single-threaded CSV reader, Polar automatically parallelizes reading across multiple CPU cores. We calculate the speedup factor.
On most machines, you will find that Polar is 2-5 times faster at reading CSV. This difference becomes even more significant with larger files.
# Measuring memory usage during operation
Speed is not the only consideration. Let’s see how much memory each library uses. We will execute a series of operations and measure the memory consumption. Please pip install psutil If it doesn’t already exist in your work environment:
import pandas as pd
import polars as pl
import psutil
import os
import gc # Import garbage collector for better memory release attempts
def get_memory_usage():
"""Get current process memory usage in MB"""
process = psutil.Process(os.getpid())
return process.memory_info().rss / 1024 / 1024
# — - Test with Pandas — -
gc.collect()
initial_memory_pandas = get_memory_usage()
df_pandas = pd.read_csv('large_dataset.csv')
filtered_pandas = df_pandas(df_pandas('age') > 30)
grouped_pandas = filtered_pandas.groupby('department')('salary').mean()
pandas_memory = get_memory_usage() - initial_memory_pandas
print(f"Pandas memory delta: {pandas_memory:.1f} MB")
del df_pandas, filtered_pandas, grouped_pandas
gc.collect()
# — - Test with Polars (eager mode) — -
gc.collect()
initial_memory_polars = get_memory_usage()
df_polars = pl.read_csv('large_dataset.csv')
filtered_polars = df_polars.filter(pl.col('age') > 30)
grouped_polars = filtered_polars.group_by('department').agg(pl.col('salary').mean())
polars_memory = get_memory_usage() - initial_memory_polars
print(f"Polars memory delta: {polars_memory:.1f} MB")
del df_polars, filtered_polars, grouped_polars
gc.collect()
# — - Summary — -
if pandas_memory > 0 and polars_memory > 0:
print(f"Memory savings (Polars vs Pandas): {(1 - polars_memory/pandas_memory) * 100:.1f}%")
elif pandas_memory == 0 and polars_memory > 0:
print(f"Polars used {polars_memory:.1f} MB while Pandas used 0 MB.")
elif polars_memory == 0 and pandas_memory > 0:
print(f"Polars used 0 MB while Pandas used {pandas_memory:.1f} MB.")
else:
print("Cannot compute memory savings due to zero or negative memory usage delta in both frameworks.")
This code measures the memory footprint:
- we use Pasuttil Library To track memory usage before and after operations
- Both libraries read the same file and perform filtering and grouping
- We calculate the difference in memory consumption
Sample Output:
Pandas memory delta: 44.4 MB
Polars memory delta: 1.3 MB
Memory savings (Polars vs Pandas): 97.1%
The above results show the memory usage delta for both pandas and polar when performing filtering and aggregation operations large_dataset.csv.
- panda memory delta: Shows the memory used by pandas for operations.
- polar memory delta: indicates the memory consumed by pollers for the same operations.
- Memory Saving (Polar vs Pandas): This metric provides a percentage of how much less memory pollers used compared to pandas.
It is common for Polar to demonstrate memory efficiency due to its columnar data storage and optimized execution engine. Typically, you will see a 30% to 70% improvement using Polarr.
Comment: However, sequential memory measurement is used within the same Python process
psutil.Process(...).memory_info().rssCan be confusing sometimes. Python’s memory allocator does not always release memory immediately to the operating system, so a ‘clean’ baseline for later testing may still be affected by a previous operation. For the most accurate comparisons, the tests should ideally be run in separate, isolated Python processes.
# Comparing syntax for basic operations
Now let’s see how the syntax differs between the two libraries. We’ll cover the most common operations you’ll use.
// Selecting Columns
Let’s select a subset of the columns. We’ll create a very small DataFrame for this (and subsequent examples).
import pandas as pd
import polars as pl
# Create sample data
data = {
'name': ('Anna', 'Betty', 'Cathy'),
'age': (25, 30, 35),
'salary': (50000, 60000, 70000)
}
# Pandas approach
df_pandas = pd.DataFrame(data)
result_pandas = df_pandas(('name', 'salary'))
# Polars approach
df_polars = pl.DataFrame(data)
result_polars = df_polars.select(('name', 'salary'))
# Alternative: More expressive
result_polars_alt = df_polars.select((pl.col('name'), pl.col('salary')))
print("Pandas result:")
print(result_pandas)
print("nPolars result:")
print(result_polars)
Here are the main differences:
- Pandas uses bracket notation:
df(('col1', 'col2')) - uses polars
.select()Method - Polar also supports more expressive
pl.col()syntax, which becomes powerful for complex operations
Output:
Pandas result:
name salary
0 Anna 50000
1 Betty 60000
2 Cathy 70000
Polars result:
shape: (3, 2)
┌───────┬────────┐
│ name ┆ salary │
│ — - ┆ — - │
│ str ┆ i64 │
╞═══════╪════════╡
│ Anna ┆ 50000 │
│ Betty ┆ 60000 │
│ Cathy ┆ 70000 │
└───────┴────────┘
Both give similar output, but pollers’ syntax is more explicit about what you are doing.
// filtering rows
Now let’s filter the rows:
# pandas: Filter rows where age > 28
filtered_pandas = df_pandas(df_pandas('age') > 28)
# Alternative Pandas syntax with query
filtered_pandas_alt = df_pandas.query('age > 28')
# Polars: Filter rows where age > 28
filtered_polars = df_polars.filter(pl.col('age') > 28)
print("Pandas filtered:")
print(filtered_pandas)
print("nPolars filtered:")
print(filtered_polars)
Note the differences:
- In pandas, we use Boolean indexing with bracket notation. you can also use
.query()Method - uses polars
.filter()with methodpl.col()expressions. - The syntax of polls reads more like SQL: “Filter where column age is greater than 28”.
Output:
Pandas filtered:
name age salary
1 Betty 30 60000
2 Cathy 35 70000
Polars filtered:
shape: (2, 3)
┌───────┬─────┬────────┐
│ name ┆ age ┆ salary │
│ — - ┆ — - ┆ — - │
│ str ┆ i64 ┆ i64 │
╞═══════╪═════╪════════╡
│ Betty ┆ 30 ┆ 60000 │
│ Cathy ┆ 35 ┆ 70000 │
└───────┴─────┴────────┘
// adding new columns
Now add new columns to the dataframe:
# pandas: Add a new column
df_pandas('bonus') = df_pandas('salary') * 0.1
df_pandas('total_comp') = df_pandas('salary') + df_pandas('bonus')
# Polars: Add new columns
df_polars = df_polars.with_columns((
(pl.col('salary') * 0.1).alias('bonus'),
(pl.col('salary') * 1.1).alias('total_comp')
))
print("Pandas with new columns:")
print(df_pandas)
print("nPolars with new columns:")
print(df_polars)
Output:
Pandas with new columns:
name age salary bonus total_comp
0 Anna 25 50000 5000.0 55000.0
1 Betty 30 60000 6000.0 66000.0
2 Cathy 35 70000 7000.0 77000.0
Polars with new columns:
shape: (3, 5)
┌───────┬─────┬────────┬────────┬────────────┐
│ name ┆ age ┆ salary ┆ bonus ┆ total_comp │
│ — - ┆ — - ┆ — - ┆ — - ┆ — - │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═══════╪═════╪════════╪════════╪════════════╡
│ Anna ┆ 25 ┆ 50000 ┆ 5000.0 ┆ 55000.0 │
│ Betty ┆ 30 ┆ 60000 ┆ 6000.0 ┆ 66000.0 │
│ Cathy ┆ 35 ┆ 70000 ┆ 7000.0 ┆ 77000.0 │
└───────┴─────┴────────┴────────┴────────────┘
What’s going on over here:
- Pandas uses direct column assignment, which modifies the dataframe
- polar uses
.with_columns()and returns a new DataFrame (immutable by default) - In polars, you use
.alias()To name the new column
The polar approach promotes immutability and makes data changes more readable.
# Measuring Performance in Grouping and Aggregation
Let’s look at a more useful example: grouping data and calculating multiple aggregations. This code shows how we group the data by department, calculate multiple statistics on different columns, and time both operations to see the performance difference:
# Load our large dataset
df_pandas = pd.read_csv('large_dataset.csv')
df_polars = pl.read_csv('large_dataset.csv')
# pandas: Group by department and calculate stats
import time
start = time.time()
result_pandas = df_pandas.groupby('department').agg({
'salary': ('mean', 'median', 'std'),
'age': 'mean'
}).reset_index()
result_pandas.columns = ('department', 'avg_salary', 'median_salary', 'std_salary', 'avg_age')
pandas_time = time.time() - start
# Polars: Same operation
start = time.time()
result_polars = df_polars.group_by('department').agg((
pl.col('salary').mean().alias('avg_salary'),
pl.col('salary').median().alias('median_salary'),
pl.col('salary').std().alias('std_salary'),
pl.col('age').mean().alias('avg_age')
))
polars_time = time.time() - start
print(f"Pandas time: {pandas_time:.3f}s")
print(f"Polars time: {polars_time:.3f}s")
print(f"Speedup: {pandas_time/polars_time:.1f}x")
print("nPandas result:")
print(result_pandas)
print("nPolars result:")
print(result_polars)
Output:
Pandas time: 0.126s
Polars time: 0.077s
Speedup: 1.6x
Pandas result:
department avg_salary median_salary std_salary avg_age
0 Engineering 89954.929266 89919.0 34595.585863 48.953405
1 Finance 89898.829762 89817.0 34648.373383 49.006690
2 HR 90080.629637 90177.0 34692.117761 48.979005
3 Marketing 90071.721095 90154.0 34625.095386 49.085454
4 Sales 89980.433386 90065.5 34634.974505 49.003168
Polars result:
shape: (5, 5)
┌─────────────┬──────────────┬───────────────┬──────────────┬───────────┐
│ department ┆ avg_salary ┆ median_salary ┆ std_salary ┆ avg_age │
│ — - ┆ — - ┆ — - ┆ — - ┆ — - │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════════╪══════════════╪═══════════════╪══════════════╪═══════════╡
│ HR ┆ 90080.629637 ┆ 90177.0 ┆ 34692.117761 ┆ 48.979005 │
│ Sales ┆ 89980.433386 ┆ 90065.5 ┆ 34634.974505 ┆ 49.003168 │
│ Engineering ┆ 89954.929266 ┆ 89919.0 ┆ 34595.585863 ┆ 48.953405 │
│ Marketing ┆ 90071.721095 ┆ 90154.0 ┆ 34625.095386 ┆ 49.085454 │
│ Finance ┆ 89898.829762 ┆ 89817.0 ┆ 34648.373383 ┆ 49.00669 │
└─────────────┴──────────────┴───────────────┴──────────────┴───────────┘
Breaking down the syntax:
- Pandas uses a dictionary to specify aggregations, which can be confusing with complex operations.
- Polar uses chaining method: each operation is clear and named
The Polars syntax is more functional but also more readable. You can immediately see what statistics are being calculated.
# Understanding lazy evaluation in poles
Lazy evaluation is one of the most useful features of pollers. This means that It does not execute your query immediately. Instead, it plans the entire operation and optimizes it before it runs..
Let’s see this in action:
import polars as pl
# Read in lazy mode
df_lazy = pl.scan_csv('large_dataset.csv')
# Build a complex query
result = (
df_lazy
.filter(pl.col('age') > 30)
.filter(pl.col('salary') > 50000)
.group_by('department')
.agg((
pl.col('salary').mean().alias('avg_salary'),
pl.len().alias('employee_count')
))
.filter(pl.col('employee_count') > 1000)
.sort('avg_salary', descending=True)
)
# Nothing has been executed yet!
print("Query plan created, but not executed")
# Now execute the optimized query
import time
start = time.time()
result_df = result.collect() # This runs the query
execution_time = time.time() - start
print(f"nExecution time: {execution_time:.3f}s")
print(result_df)
Output:
Query plan created, but not executed
Execution time: 0.177s
shape: (5, 3)
┌─────────────┬───────────────┬────────────────┐
│ department ┆ avg_salary ┆ employee_count │
│ — - ┆ — - ┆ — - │
│ str ┆ f64 ┆ u32 │
╞═════════════╪═══════════════╪════════════════╡
│ HR ┆ 100101.595816 ┆ 132212 │
│ Marketing ┆ 100054.012365 ┆ 132470 │
│ Sales ┆ 100041.01049 ┆ 132035 │
│ Finance ┆ 99956.527217 ┆ 132143 │
│ Engineering ┆ 99946.725458 ┆ 132384 │
└─────────────┴───────────────┴────────────────┘
Here, scan_csv() The file does not load immediately; This is just a plan to read it. We create a series of filters, groupings and sorts. Pollers analyze and optimize the entire query. For example, it can filter all data before reading it.
only when we call .collect() What is the actual calculation? The optimized query runs much faster than executing each step separately.
# wrapping up
As seen, Polar is extremely useful for data processing with Python. It is faster, more memory-efficient, and has a cleaner API than pandas. That said, the panda isn’t going anywhere. It has over a decade of development, a huge ecosystem, and millions of users. For many projects, pandas is still the right choice.
Learn Pollers if you are considering large scale analysis for data engineering projects etc. The syntax difference is not huge, and the performance benefits are real. But keep pandas in your toolkit for compatibility and quick exploratory work.
Start by trying out Polarr on a side project or slow-running data pipeline. You’ll quickly get a feel for whether it’s right for your use case or not. Happy data wrangling!
Bala Priya C is a developer and technical writer from India. She likes to work in the fields of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, Data Science, and Natural Language Processing. She loves reading, writing, coding, and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.
