Image by author
# Introduction
Data analytics has changed in recent years. The traditional approach of loading everything into a relational database and running SQL queries still works, but it is often excessive for some analytical workloads. storing data in wooden roof files and querying them directly duckdb Faster, simpler and more effective.
In this article, I will show you how to build a data analytics stack Python Which uses DuckDB to query data stored in Parquet files. We’ll work with a sample dataset, explore how each component works, and understand why this approach can be useful for your data science projects.
You can find the code on GitHub.
# Prerequisites
Before you begin, make sure you have:
- Python 3.10 or later installed
- Understanding of SQL basics and Panda dataframe operations
- Becoming familiar with data analysis concepts
Also install the required libraries:
pip install duckdb pandas pyarrow numpy faker
# Understanding the Recommended Data Analytics Stack
Let’s start by understanding what each component does and why they work well together.
Parquet is a column storage format originally designed for Hadoop Ecosystem. Unlike row-based formats like CSV, where each row is an entire record, Parquet organizes data by columns. This may seem like a small difference, but it has a huge impact on the analysis.
When you run a query that requires only three columns from a table with fifty columns, Parquet lets you read only those three columns. With CSV, you have to read each row completely and then remove the 47 columns you don’t need. This makes Parquet faster for specific analytical queries. Additionally, column storage compresses well because values ​​in the same column are similar.
DuckDB is an embedded analytical database. Whereas sqlite Optimized for transactional workloads requiring many small reads and writes, DuckDB is specifically designed for analytical queries that require scanning large amounts of data, aggregation, and joins. The embedded part means it runs inside your Python process, so there’s no separate database server to set up or manage.
What makes DuckDB special for analytics is that it can query Parquet files directly. You do not need to import data into the database first. Point DuckDB to a Parquet file, write SQL, and it only reads what it needs. This “query in place” capability is what makes the entire stack useful.
You can use it in your Python development environment. You store the data in Parquet files, Pandas handles the data manipulation, DuckDB executes analytical queries, and the entire Python ecosystem is available for visualization, machine learning, and automation.
# Creating a Sample Dataset
We will use an e-commerce dataset. you can use it data_generator.py script To generate sample dataset or Follow this notebook.
The dataset includes customers who place orders, orders that contain multiple items, and products with categories and pricing.
data is referential integrity. Each Order refers to a valid Customer, and each Order Item refers to both a valid Order and Product. This helps us create meaningful connections and aggregation.
# Saving data to parquet file
Before we save our data, let’s understand why Parquet is effective for analytics. We have already discussed the advantages of column storage formats like parquet, but let’s discuss it again, this time in more detail.
In a CSV file, data is stored row by row. If you have a million rows with 50 columns, and you only want to analyze one column, you still need to read all 50 million values ​​to discard the columns you don’t need. This is wasteful expenditure.
Parquet, as we now know, stores data column by column. All values ​​of a column are stored together. When you question a column, you read exactly that column and nothing else. For analytical queries that typically touch a small number of columns, this is very fast.
Column storage also compresses better. Values ​​in the same column are similar – they are typically all integers, all dates, or all from the same hierarchical set. Compression algorithms work much better on uniform data than on random data.
Let’s save your data as Parquet and see the benefits:
# Save tables as Parquet files
customers_df.to_parquet('customers.parquet', engine="pyarrow", compression='snappy')
products_df.to_parquet('products.parquet', engine="pyarrow", compression='snappy')
orders_df.to_parquet('orders.parquet', engine="pyarrow", compression='snappy')
order_items_df.to_parquet('order_items.parquet', engine="pyarrow", compression='snappy')
# Compare with CSV to see the difference
customers_df.to_csv('customers.csv', index=False)
orders_df.to_csv('orders.csv', index=False)
import os
def get_size_mb(filename):
return os.path.getsize(filename) / (1024 * 1024)
print("Storage Comparison:")
print(f"customers.csv: {get_size_mb('customers.csv'):.2f} MB")
print(f"customers.parquet: {get_size_mb('customers.parquet'):.2f} MB")
print(f"Savings: {(1 - get_size_mb('customers.parquet')/get_size_mb('customers.csv'))*100:.1f}%n")
print(f"orders.csv: {get_size_mb('orders.csv'):.2f} MB")
print(f"orders.parquet: {get_size_mb('orders.parquet'):.2f} MB")
print(f"Savings: {(1 - get_size_mb('orders.parquet')/get_size_mb('orders.csv'))*100:.1f}%")
Output:
Storage Comparison:
customers.csv: 0.73 MB
customers.parquet: 0.38 MB
Savings: 48.5%
orders.csv: 3.01 MB
orders.parquet: 1.25 MB
Savings: 58.5%
These compression ratios are specific. Parquet usually achieves better compression than CSV. The compression we are using here is FastWhich prioritizes speed over maximum compression.
Note: Parquet supports other codecs like gzipwhich provides better compression but is slower, and Zstd For a good balance between compression and speed.
# Querying Parquet files with DuckDB
Now comes the interesting part. We can query these Parquet files directly using SQL without loading them into the database first.
import duckdb
# Create a DuckDB connection
con = duckdb.connect(database=":memory:")
# Query the Parquet file directly
query = """
SELECT
customer_segment,
COUNT(*) as num_customers,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as percentage
FROM 'customers.parquet'
GROUP BY customer_segment
ORDER BY num_customers DESC
"""
result = con.execute(query).fetchdf()
print("Customer Distribution:")
print(result)
Output:
Customer Distribution:
customer_segment num_customers percentage
0 Standard 5070 50.70
1 Basic 2887 28.87
2 Premium 2043 20.43
See query syntax: FROM 'customers.parquet'. DuckDB reads the file directly. There is no import step, no CREATE TABLE Statement, and no need to wait for data to be loaded. You write SQL, DuckDB figures out what data it needs from the file, and returns the results.
In the traditional workflow, you would need to create a database, define the schema, import data, create indexes, and then finally perform queries. With DuckDB and Parquet, you leave all that behind. Under the hood, DuckDB reads Parquet file metadata to understand the schema, then uses predicate pushdown to prevent reading data that doesn’t match yours. WHERE Section. It reads only the columns that your query actually uses. For large files, this makes queries much faster.
# Performing Complex Analytics
Let’s run a slightly more complex analytical query. We will analyze monthly revenue trends based on customer segment.
query = """
SELECT
strftime(o.order_date, '%Y-%m') as month,
c.customer_segment,
COUNT(DISTINCT o.order_id) as num_orders,
COUNT(DISTINCT o.customer_id) as unique_customers,
ROUND(SUM(o.order_total), 2) as total_revenue,
ROUND(AVG(o.order_total), 2) as avg_order_value
FROM 'orders.parquet' AS o
JOIN 'customers.parquet' AS c
ON o.customer_id = c.customer_id
WHERE o.payment_status="completed"
GROUP BY month, c.customer_segment
ORDER BY month DESC, total_revenue DESC
LIMIT 15
"""
monthly_revenue = con.execute(query).fetchdf()
print("Recent Monthly Revenue by Segment:")
print(monthly_revenue.to_string(index=False))
Output:
Recent Monthly Revenue by Segment:
month customer_segment num_orders unique_customers total_revenue avg_order_value
2026-01 Standard 2600 1468 1683223.68 647.39
2026-01 Basic 1585 857 1031126.44 650.55
2026-01 Premium 970 560 914105.61 942.38
2025-12 Standard 2254 1571 1533076.22 680.16
2025-12 Premium 885 613 921775.85 1041.55
2025-12 Basic 1297 876 889270.86 685.64
2025-11 Standard 1795 1359 1241006.08 691.37
2025-11 Premium 725 554 717625.75 989.83
2025-11 Basic 1012 767 682270.44 674.18
2025-10 Standard 1646 1296 1118400.61 679.47
2025-10 Premium 702 550 695913.24 991.33
2025-10 Basic 988 769 688428.86 696.79
2025-09 Standard 1446 1181 970017.17 670.83
2025-09 Premium 594 485 577486.81 972.20
2025-09 Basic 750 618 495726.69 660.97
This query groups by two dimensions (month and segment), aggregates multiple metrics, and filters on payment status. This is the type of query you’ll frequently write in analytical tasks. strftime Function formats are dated directly in SQL. ROUND The function clears decimal places. Multiple aggregations run efficiently and produce the expected results.
# join multiple tables
Actual analysis rarely involves a table. Let’s join our tables to answer a business question: Which product categories generate the most revenue, and how does this vary by customer segment?
query = """
SELECT
p.category,
c.customer_segment,
COUNT(DISTINCT oi.order_id) as num_orders,
SUM(oi.quantity) as units_sold,
ROUND(SUM(oi.item_total), 2) as total_revenue,
ROUND(AVG(oi.item_total), 2) as avg_item_value
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'products.parquet' p ON oi.product_id = p.product_id
JOIN 'customers.parquet' c ON o.customer_id = c.customer_id
WHERE o.payment_status="completed"
GROUP BY p.category, c.customer_segment
ORDER BY total_revenue DESC
LIMIT 20
"""
category_analysis = con.execute(query).fetchdf()
print("Revenue by Category and Customer Segment:")
print(category_analysis.to_string(index=False))
Small Output:
Revenue by Category and Customer Segment:
category customer_segment num_orders units_sold total_revenue avg_item_value
Electronics Standard 4729 6431.0 6638814.75 1299.18
Electronics Premium 2597 3723.0 3816429.62 1292.39
Electronics Basic 2685 3566.0 3585652.92 1240.28
Automotive Standard 4506 5926.0 3050679.12 633.18
Sports Standard 5049 6898.0 2745487.54 497.55
...
...
Clothing Premium 3028 4342.0 400704.25 114.55
Clothing Basic 3102 4285.0 400391.18 117.49
Books Standard 6196 8511.0 252357.39 36.74
This query joins three tables. DuckDB automatically determines the optimal join order and execution strategy. Note how much readable SQL is compared to the equivalent pandas code. For complex analytical logic, SQL often expresses intent more clearly than dataframe operations.
# Understanding query performance
Let’s compare DuckDB with Pandas for a common analytical task.
// Method 1: Using Pandas
import time
# Analytical task: Calculate customer purchase patterns
print("Performance Comparison: Customer Purchase Analysisn")
start_time = time.time()
# Merge dataframes
merged = order_items_df.merge(orders_df, on='order_id')
merged = merged.merge(products_df, on='product_id')
# Filter completed orders
completed = merged(merged('payment_status') == 'completed')
# Group and aggregate
customer_patterns = completed.groupby('customer_id').agg({
'order_id': 'nunique',
'product_id': 'nunique',
'item_total': ('sum', 'mean'),
'category': lambda x: x.mode()(0) if len(x) > 0 else None
})
customer_patterns.columns = ('num_orders', 'unique_products', 'total_spent', 'avg_spent', 'favorite_category')
customer_patterns = customer_patterns.sort_values('total_spent', ascending=False).head(100)
pandas_time = time.time() - start_time
// Method 2: Using DuckDB
start_time = time.time()
query = """
SELECT
o.customer_id,
COUNT(DISTINCT oi.order_id) as num_orders,
COUNT(DISTINCT oi.product_id) as unique_products,
ROUND(SUM(oi.item_total), 2) as total_spent,
ROUND(AVG(oi.item_total), 2) as avg_spent,
MODE(p.category) as favorite_category
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'products.parquet' p ON oi.product_id = p.product_id
WHERE o.payment_status="completed"
GROUP BY o.customer_id
ORDER BY total_spent DESC
LIMIT 100
"""
duckdb_result = con.execute(query).fetchdf()
duckdb_time = time.time() - start_time
print(f"Pandas execution time: {pandas_time:.4f} seconds")
print(f"DuckDB execution time: {duckdb_time:.4f} seconds")
print(f"Speedup: {pandas_time/duckdb_time:.1f}x faster with DuckDBn")
print("Top 5 customers by total spent:")
print(duckdb_result.head().to_string(index=False))
Output:
Performance Comparison: Customer Purchase Analysis
Pandas execution time: 1.9872 seconds
DuckDB execution time: 0.1171 seconds
Speedup: 17.0x faster with DuckDB
Top 5 customers by total spent:
customer_id num_orders unique_products total_spent avg_spent favorite_category
8747 8 24 21103.21 879.30 Electronics
617 9 27 19596.22 725.79 Electronics
2579 9 18 17011.30 895.33 Sports
6242 7 23 16781.11 729.61 Electronics
5443 8 22 16697.02 758.96 Automotive
DuckDB is about 17 times faster. This performance difference is more pronounced with larger datasets. The pandas approach loads all the data into memory, performs several merge operations (which creates copies), and then aggregates. DuckDB reads directly from Parquet files, pushes down filters to avoid reading unnecessary data, and uses optimized join algorithms.
# Building reusable analytics queries
In production analytics, you’ll run the same query over and over again with different parameters. Let’s create a reusable function that follows best practices for this workflow.
def analyze_product_performance(con, category=None, min_revenue=None, date_from=None, top_n=20):
"""
Analyze product performance with flexible filtering.
This demonstrates how to build reusable analytical queries that can be
parameterized for different use cases. In production, you'd build a library
of these functions for common analytical questions.
"""
# Build the WHERE clause dynamically based on parameters
where_clauses = ("o.payment_status="completed"")
if category:
where_clauses.append(f"p.category = '{category}'")
if date_from:
where_clauses.append(f"o.order_date >= '{date_from}'")
where_clause = " AND ".join(where_clauses)
# Main analytical query
query = f"""
WITH product_metrics AS (
SELECT
p.product_id,
p.product_name,
p.category,
p.base_price,
COUNT(DISTINCT oi.order_id) as times_ordered,
SUM(oi.quantity) as units_sold,
ROUND(SUM(oi.item_total), 2) as total_revenue,
ROUND(AVG(oi.unit_price), 2) as avg_selling_price,
ROUND(SUM(oi.item_total) - (p.cost * SUM(oi.quantity)), 2) as profit
FROM 'order_items.parquet' oi
JOIN 'orders.parquet' o ON oi.order_id = o.order_id
JOIN 'products.parquet' p ON oi.product_id = p.product_id
WHERE {where_clause}
GROUP BY p.product_id, p.product_name, p.category, p.base_price, p.cost
)
SELECT
*,
ROUND(100.0 * profit / total_revenue, 2) as profit_margin_pct,
ROUND(avg_selling_price / base_price, 2) as price_realization
FROM product_metrics
"""
# Add revenue filter if specified
if min_revenue:
query += f" WHERE total_revenue >= {min_revenue}"
query += f"""
ORDER BY total_revenue DESC
LIMIT {top_n}
"""
return con.execute(query).fetchdf()
This function does the following. First, it creates SQL dynamically based on parameters, allowing flexible filtering without writing separate queries for each case. Second, it uses a common table expression (CTE) To organize complex logic into readable steps. Third, it calculates derived metrics such as profit margin and price realization that require multiple source columns.
Calculating profit subtracts costs from revenues using data from both the Order Items and Products tables. This type of cross-table calculation is straightforward in SQL but will be cumbersome with multiple pandas operations. DuckDB handles this efficiently in a single query.
Here is an example that uses the above function:
# Example 1: Top electronics products
electronics = analyze_product_performance(con, category='Electronics', top_n=10)
print("Top 10 Electronics Products:")
print(electronics(('product_name', 'units_sold', 'total_revenue', 'profit_margin_pct')).to_string(index=False))
Output:
Top 10 Electronics Products:
product_name units_sold total_revenue profit_margin_pct
Electronics Item 113 262.0 510331.81 38.57
Electronics Item 154 289.0 486307.74 38.28
Electronics Item 122 229.0 448680.64 38.88
Electronics Item 472 251.0 444680.20 38.51
Electronics Item 368 222.0 424057.14 38.96
Electronics Item 241 219.0 407648.10 38.75
Electronics Item 410 243.0 400078.65 38.31
Electronics Item 104 233.0 400036.84 38.73
Electronics Item 2 213.0 382583.85 38.76
Electronics Item 341 240.0 376722.94 38.94
And here’s another example:
# Example 2: High-revenue products across all categories
print("nnHigh-Revenue Products (>$50k revenue):")
high_revenue = analyze_product_performance(con, min_revenue=50000, top_n=10)
print(high_revenue(('product_name', 'category', 'total_revenue', 'profit')).to_string(index=False))
Output:
High-Revenue Products (>$50k revenue):
product_name category total_revenue profit
Electronics Item 113 Electronics 510331.81 196846.19
Electronics Item 154 Electronics 486307.74 186140.78
Electronics Item 122 Electronics 448680.64 174439.40
Electronics Item 472 Electronics 444680.20 171240.80
Electronics Item 368 Electronics 424057.14 165194.04
Electronics Item 241 Electronics 407648.10 157955.25
Electronics Item 410 Electronics 400078.65 153270.84
Electronics Item 104 Electronics 400036.84 154953.46
Electronics Item 2 Electronics 382583.85 148305.15
Electronics Item 341 Electronics 376722.94 146682.94
# wrapping up
In this article, we have analyzed e-commerce data. We prepared relational data, stored it as Parquet, and queried it with DuckDB. Performance comparisons showed substantial speedup compared to the traditional Pandas approach.
Use this stack when you are performing analytical workloads on structured data. This is useful if you’re collecting, filtering, combining, and computing metrics. This works well for data that changes in batches rather than continuously. If you’re analyzing yesterday’s sales, processing monthly reports, or exploring historical trends, periodically updated Parquet files work great. You don’t need a live database accepting continuous writes.
However, this stack is not suitable for everything:
- If you need real-time updates with many concurrent writers, you need a traditional database with ACID transactions.
- If you are building an application with user-facing queries that require millisecond response times, an indexed database is better.
- If multiple users need to perform queries simultaneously with different access permissions, the database server provides better control.
The best part is analytical work on large datasets, where data updates happen in batches and you need fast, flexible queries and analysis.
Happy analysis!
Bala Priya C is a developer and technical writer from India. She likes to work in the fields of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, Data Science, and Natural Language Processing. She loves reading, writing, coding, and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.
