
Image by author
# Introduction
Hugging Face Datasets provides one of the simplest ways to load a dataset using a single line of code. These datasets are often available in formats such as CSV, Parquet and Arrow. While all three are designed to store tabular data, they work differently on the backend. The selection of each format determines how the data is stored, how quickly it can be loaded, how much storage space is required, and how efficiently the data types are preserved. As datasets become larger and models become more complex, these differences become more significant. In this article, we’ll look at how the Hugging Faces dataset works with CSV, Parquet, and Arrow, what exactly makes them different on disk and in memory, and when it makes sense to use each. so let’s get started.
# 1. csv
CSV stands for Comma-Separated Values. It’s just text, one line per line, columns separated by commas (or tabs). Almost every device can open it i.e. Excel, Google Sheets, Pandas, Database etc. It is very simple and interoperable.
Example:
name,age,city
Kanwal,30,New York
Qasim,25,Edmonton
Hugging Face treats it as a row-based format, which means it reads data row by row. Although this is acceptable for small datasets, performance degrades with scaling. Additionally, there are some other limitations, such as:
- No clear schema: Since all data is stored in text format, it is necessary to infer types every time the file is loaded. If the data is not consistent it can lead to errors.
- Large size and slow I/O: Storing text increases file size, and parsing numbers from text is CPU-intensive.
# 2. Parquet
The wooden roof is a binary column format. Instead of writing rows one after another like CSV, Parquet groups values ​​by column. When you only need a few columns, this makes reads and queries much faster, and compression keeps file size and I/O low. Parquet also stores a schema so that types are preserved. This works best for batch processing and large-scale analytics, not for many small, frequent updates to the same file (it’s better for batch writing than frequent editing). If we take the above CSV example, it will store all names together, all ages together and all cities together. This is the columnar layout and the example would look like this:
Names: Kanwal, Qasim
Ages: 30, 25
Cities: New York, Edmonton
It also adds metadata for each column: type, min/max values, null count, and compression information. It allows fast reading, efficient storage and precise type handling. Compression algorithms like Snappy or GZip further reduce disk space. It has the following strengths:
- Compression: Identical column values ​​compress well. Files are small and cheap to store.
- Column-wise reading: Load only the columns you need, speeding up queries.
- Rich Typing: The schema is stored, so there is no inference type on every load.
- scale: Works well for millions or billions of rows.
# 3. Arrow
Arrow is not the same as CSV or Parquet. It is a column format kept in memory for faster operations. In Hugging Face, every dataset is supported by an Arrow table, whether you started with a CSV, Parquet, or Arrow file. Continuing the same example table, Arrow also stores data column by column, but in memory:
Names: contiguous memory block storing Kanwal, Qasim
Ages: contiguous memory block storing 30, 25
Cities: contiguous memory block storing New York, Edmonton
Because the data is in contiguous blocks, operations on a column (such as filtering, mapping, or summarizing) are extremely fast. Arrow also supports memory mapping, which allows datasets to be accessed from disk without having to be fully loaded into RAM. Some of the major advantages of this format are:
- Zero-copy reads: Memory-mapped files without loading everything into RAM.
- Fast Column Access: Columnar layout enables vectorized operations.
- Rich Type: Handles nested data, lists, tensors.
- Interoperable: Works with pandas, pyero, spark, polar and others.
# wrapping up
Embracing face datasets makes changing formats routine. Use CSV for quick experiments, Parquet for storing large tables, and Arrow for fast in-memory training. Knowing when to use each one keeps your pipeline fast and simple, so you can spend more time on the model.
Kanwal Mehreen He is a machine learning engineer and a technical writer with a deep passion for the intersection of AI with data science and medicine. He co-authored the eBook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she is an advocate for diversity and academic excellence. She has also been recognized as a Teradata Diversity in Tech Scholar, a Mitex GlobalLink Research Scholar, and a Harvard VCode Scholar. Kanwal is a strong advocate for change, having founded FEMCodes to empower women in STEM fields.