5 lightweight alternatives to Panda you should try

by
0 comments

5 lightweight alternatives to Panda you should try5 lightweight alternatives to Panda you should try
Image by author

, Introduction

developers use Panda For data manipulation, but it can be slow, especially with large datasets. Because of this, many people are looking for faster and lighter alternatives. These options maintain the core features required for analysis while focusing on speed, low memory usage, and simplicity. In this article, we will look at five lightweight alternatives to Panda that you can try.

, 1. DuckDB

duckdb Is like sqlite for analysis. You can run SQL queries directly on comma separated values ​​(CSV) files. This is useful if you know SQL or work with machine learning pipelines. Install it with:

We will use the Titanic dataset and run a simple SQL query on it like this:

import duckdb

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"

# Run SQL query on the CSV
result = duckdb.query(f"""
    SELECT sex, age, survived
    FROM read_csv_auto('{url}')
    WHERE age > 18
""").to_df()

print(result.head())

Output:


      sex     age   survived
0     male    22.0          0
1   female    38.0          1
2   female    26.0          1
3   female    35.0          1
4     male    35.0          0

DuckDB runs SQL queries directly on the CSV file and then converts the output into a DataFrame. You get SQL speed with Python flexibility.

, 2. Polar

polar One of the most popular data libraries available today. It has been implemented in War The language is exceptionally fast with minimal memory requirements. The syntax is also very clear. Let’s install it using pip:

Now, let’s use the Titanic dataset to cover a simple example:

import polars as pl

# Load dataset 
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"
df = pl.read_csv(url)

result = df.filter(pl.col("age") > 40).select(("sex", "age", "survived"))
print(result)

Output:


shape: (150, 3)
┌────────┬──────┬──────────┐
│ sex    ┆ age  ┆ survived │
│ ---    ┆ ---  ┆ ---      │
│ str    ┆ f64  ┆ i64      │
╞════════╪══════╪══════════╡
│ male   ┆ 54.0 ┆ 0        │
│ female ┆ 58.0 ┆ 1        │
│ female ┆ 55.0 ┆ 1        │
│ male   ┆ 66.0 ┆ 0        │
│ male   ┆ 42.0 ┆ 0        │
│ …      ┆ …    ┆ …        │
│ female ┆ 48.0 ┆ 1        │
│ female ┆ 42.0 ┆ 1        │
│ female ┆ 47.0 ┆ 1        │
│ male   ┆ 47.0 ┆ 0        │
│ female ┆ 56.0 ┆ 1        │
└────────┴──────┴──────────┘

Poller reads the CSV, filters rows based on age conditions, and selects a subset of columns.

, 3. Pyro

pyero is a lightweight library for columnar data. use tools like polar apache arrow For speed and memory efficiency. It is not a full alternative to pandas but is excellent for reading and preprocessing files. Install it with:

For our example, let’s use the Iris dataset in CSV form as follows:

import pyarrow.csv as csv
import pyarrow.compute as pc
import urllib.request

# Download the Iris CSV 
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
local_file = "iris.csv"
urllib.request.urlretrieve(url, local_file)

# Read with PyArrow
table = csv.read_csv(local_file)

# Filter rows
filtered = table.filter(pc.greater(table('sepal_length'), 5.0))

print(filtered.slice(0, 5))

Output:


pyarrow.Table
sepal_length: double
sepal_width: double
petal_length: double
petal_width: double
species: string
----
sepal_length: ((5.1,5.4,5.4,5.8,5.7))
sepal_width: ((3.5,3.9,3.7,4,4.4))
petal_length: ((1.4,1.7,1.5,1.2,1.5))
petal_width: ((0.2,0.4,0.2,0.2,0.4))
species: (("setosa","setosa","setosa","setosa","setosa"))

PyArrow reads CSV and converts it to columnar format. The name and type of each column are listed in a clear schema. This setup makes it faster to inspect and filter large datasets.

, 4. modin

modin This is for those who want fast performance without learning a new library. It uses the same pandas API but runs operations in parallel. You don’t need to change your existing code; Just update the import. Everything else works like normal pandas. Install it with pip:

For a better understanding, let’s try a small example using the same Titanic dataset:

import modin.pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv"

# Load the dataset
df = pd.read_csv(url)

# Filter the dataset 
adults = df(df("age") > 18)

# Select only a few columns to display
adults_small = adults(("survived", "sex", "age", "class"))

# Display result
adults_small.head()

Output:


   survived     sex   age   class
0         0    male  22.0   Third
1         1  female  38.0   First
2         1  female  26.0   Third
3         1  female  35.0   First
4         0    male  35.0   Third

The mod spreads the work across CPU cores, which means you’ll get better performance without doing anything extra.

, 5. Dusk

How do you handle big data without increasing RAM? dusk This is a great option when you have files whose size is larger than your computer’s random access memory (RAM). It uses lazy evaluation, so it does not load the entire dataset into memory. It helps you process millions of rows smoothly. Install it with:

pip install dask(complete)

To try this, we can use the Chicago crime dataset as follows:

import dask.dataframe as dd
import urllib.request

url = "https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD"
local_file = "chicago_crime.csv"
urllib.request.urlretrieve(url, local_file)

# Read CSV with Dask (lazy evaluation)
df = dd.read_csv(local_file, dtype=str)  # all columns as string

# Filter crimes classified as 'THEFT'
thefts = df(df('Primary Type') == 'THEFT')

# Select a few relevant columns
thefts_small = thefts(("ID", "Date", "Primary Type", "Description", "District"))

print(thefts_small.head())

Output:


          ID                   Date Primary Type       Description District            
5   13204489 09/06/2023 11:00:00 AM        THEFT         OVER $500      001
50  13179181 08/17/2023 03:15:00 PM        THEFT      RETAIL THEFT      014
51  13179344 08/17/2023 07:25:00 PM        THEFT      RETAIL THEFT      014
53  13181885 08/20/2023 06:00:00 AM        THEFT    $500 AND UNDER      025
56  13184491 08/22/2023 11:44:00 AM        THEFT      RETAIL THEFT      014

filtering (Primary Type == 'THEFT') and selecting a column is a lazy operation. Filtering happens instantly because Dask processes the data in chunks rather than loading everything at once.

, conclusion

We covered five alternatives to Pandas and how to use them. The article keeps things simple and focused. See the official documentation for each library for full details:

If you encounter any problems, leave a comment and I will help.

Kanwal Mehreen He is a machine learning engineer and a technical writer with a deep passion for the intersection of AI with data science and medicine. He co-authored the eBook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she is an advocate for diversity and academic excellence. She has also been recognized as a Teradata Diversity in Tech Scholar, a Mitex GlobalLink Research Scholar, and a Harvard VCode Scholar. Kanwal is a strong advocate for change, having founded FEMCodes to empower women in STEM fields.

Related Articles

Leave a Comment