How to handle large datasets in Python even if you are a newbie

by
0 comments
How to handle large datasets in Python even if you are a newbie

How to handle large datasets in Python even if you are a newbie
Image by author

, Introduction

Working with large datasets in Python often raises a common problem: How do you load your data? PandaAnd your program slows to a crawl or crashes completely. This usually happens because you are trying to load everything into memory at once.

Most memory problems arise from How You load and process the data. With some practical techniques, you can handle datasets much larger than your available memory.

In this article, you will learn seven techniques for working efficiently with large datasets in Python. We’ll start simple and work your way up, so by the end, you’ll know exactly which approach fits your use case.

🔗 you can find Code on GitHubyou can run it if you want Sample Data Generator Python Script Sample code snippet to use to get CSV files and process them.

, 1. Read the data in chunks

The most beginner-friendly approach is to process your data in smaller chunks rather than loading everything at once.

Consider a scenario where you have a large sales dataset and you want to find the total revenue. The following code demonstrates this approach:

import pandas as pd

# Define chunk size (number of rows per chunk)
chunk_size = 100000
total_revenue = 0

# Read and process the file in chunks
for chunk in pd.read_csv('large_sales_data.csv', chunksize=chunk_size):
    # Process each chunk
    total_revenue += chunk('revenue').sum()

print(f"Total Revenue: ${total_revenue:,.2f}")

Instead of loading all 10 million rows at once, we’re loading 100,000 rows at a time. We calculate the sum for each part and add it to our total. You only have 100,000 lines in RAM, no matter how big the file is.

when to use it: When you need to perform aggregation (sum, count, average) or filtering operations on large files.

, 2. Use only specific columns

Often, you don’t need every column in your dataset. Loading only what you need can significantly reduce memory usage.

Let’s say you’re analyzing customer data, but you only need age and purchase amount instead of many other columns:

import pandas as pd

# Only load the columns you actually need
columns_to_use = ('customer_id', 'age', 'purchase_amount')

df = pd.read_csv('customers.csv', usecols=columns_to_use)

# Now work with a much lighter dataframe
average_purchase = df.groupby('age')('purchase_amount').mean()
print(average_purchase)

by specifying usecolsPandas only loads those three columns into memory. If your original file had 50 columns, you have cut your memory usage by about 94%.

when to use it: When you know which columns you need before loading the data.

, 3. Customize data types

By default, pandas may use more memory than necessary. A column of integers can be stored as 64-bit when 8-bit will work fine.

For example, if you are loading a dataset with product ratings (1-5 stars) and user IDs:

import pandas as pd

# First, let's see the default memory usage
df = pd.read_csv('ratings.csv')
print("Default memory usage:")
print(df.memory_usage(deep=True))

# Now optimize the data types
df('rating') = df('rating').astype('int8')  # Ratings are 1-5, so int8 is enough
df('user_id') = df('user_id').astype('int32')  # Assuming user IDs fit in int32

print("nOptimized memory usage:")
print(df.memory_usage(deep=True))

By changing the rating column from probable to int64 (8 bytes per number) int8 (1 byte per number), we get 8x memory reduction for that column.

Common conversions include:

  • int64 int8, int16Or int32 (depending on the range of numbers).
  • float64 float32 (If you don’t need extreme precision).
  • object category (For columns with repeated values).

, 4. Use hierarchical data types

When a column contains repeated text values ​​(such as country names or product categories), pandas stores each value separately. category dtype stores unique values ​​once and uses efficient code to reference them.

Let’s say you’re working with a product inventory file where the Category column has only 20 unique values, but they are repeated across all rows in the dataset:

import pandas as pd

df = pd.read_csv('products.csv')

# Check memory before conversion
print(f"Before: {df('category').memory_usage(deep=True) / 1024**2:.2f} MB")

# Convert to category
df('category') = df('category').astype('category')

# Check memory after conversion
print(f"After: {df('category').memory_usage(deep=True) / 1024**2:.2f} MB")

# It still works like normal text
print(df('category').value_counts())

This transformation can substantially reduce memory usage for columns with low cardinality (few unique values). Columns still function the same as standard text data: you can filter, group, and sort as usual.

when to use it: For any text column where values ​​are repeated (categories, states, countries, departments and so on).

, 5. Filter while reading

Sometimes you know you only need a subset of rows. Instead of loading everything and then filtering, you can filter during the load process.

For example, if you only care about transactions from the year 2024:

import pandas as pd

# Read in chunks and filter
chunk_size = 100000
filtered_chunks = ()

for chunk in pd.read_csv('transactions.csv', chunksize=chunk_size):
    # Filter each chunk before storing it
    filtered = chunk(chunk('year') == 2024)
    filtered_chunks.append(filtered)

# Combine the filtered chunks
df_2024 = pd.concat(filtered_chunks, ignore_index=True)

print(f"Loaded {len(df_2024)} rows from 2024")

We are combining chunking with filtering. Each segment is filtered before being added to our list, so we never keep the entire dataset in memory, only the rows we actually want.

when to use it: When you need only a subset of rows based on a condition.

, 6. Use Dask for parallel processing

For datasets that are really huge, dusk Provides a pandas-like API but handles all the chunking and parallel processing automatically.

Here’s how you would calculate the average of a column in a huge dataset:

import dask.dataframe as dd

# Read with Dask (it handles chunking automatically)
df = dd.read_csv('huge_dataset.csv')

# Operations look just like pandas
result = df('sales').mean()

# Dask is lazy - compute() actually executes the calculation
average_sales = result.compute()

print(f"Average Sales: ${average_sales:,.2f}")

Dashk does not load the entire file into memory. Instead, it creates a plan to process the data in pieces and executes that plan when you call .compute()It can also use multiple CPU cores to speed up calculations,

when to use it: When your dataset is too large for pandas, even with chunking, or when you want parallel processing without writing complex code.

, 7. Sample Your Data for Exploration

When you’re just exploring or testing code, you don’t need the entire dataset. Load a sample first.

Let’s say you’re building a machine learning model and want to test your preprocessing pipeline. You can sample your dataset as shown:

import pandas as pd

# Read just the first 50,000 rows
df_sample = pd.read_csv('huge_dataset.csv', nrows=50000)

# Or read a random sample using skiprows
import random
skip_rows = lambda x: x > 0 and random.random() > 0.01  # Keep ~1% of rows

df_random_sample = pd.read_csv('huge_dataset.csv', skiprows=skip_rows)

print(f"Sample size: {len(df_random_sample)} rows")

The first approach loads the first N rows, which is suitable for fast exploration. Another approach is to randomly sample rows in the entire file, which is better for statistical analysis or when the file is sorted in such a way that the top rows become unnumbered.

when to use it: During development, testing, or exploratory analysis before running your code on the full dataset.

, conclusion

Handling large datasets does not require expert level skills. Here’s a quick summary of the techniques we discussed:

technology when to use it
jam

Aggregating, filtering, and processing data you can’t fit into RAM.

column selection

When you need only a few columns from a wide dataset.

data type customization

Always; Do this after loading to save memory.

hierarchical type

For text columns with repeated values ​​(categories, states, etc.).

filter while reading

When you only need a subset of rows.

dusk

For very large datasets or when you want parallel processing.

Sampling

During development and exploration.

The first step is to know both your data and your taskMost of the time, a combination of chunking and smart column selection will get you 90% of the way there,

As your needs grow, move up to a more advanced tool like Dashk or consider converting your data to more efficient file formats wooden roof Or hdf5,

Now go ahead and start working with those huge datasets. Happy analysis!

Bala Priya C is a developer and technical writer from India. She likes to work in the fields of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, Data Science, and Natural Language Processing. She loves reading, writing, coding, and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Related Articles

Leave a Comment