Pandas Doing Parquet

Pandas Logo
A data science world without CSVs

Contents

  • Why?
  • Ingesting CSV
  • Writing parquet
  • Reading parquet
  • Getting fancy

How is Parquet superior to CSV

CSV's limitations

  • Uncompressed (big files)
  • Whole file is read before accessing one column
  • > Similar to a 'Full Table Scan' in a database
  • Reading is single threaded (slow)
  • Not suited to distributed computing

Why doesn't it work for distributed computing (DC)

DC works by spliting and sharing data across many machines

  • CSVs aren't partitioned (naturally splitable)
  • Column headings aren't native
  • Column datatypes aren't explicit
  • Column headings are mixed in with data

Parquet's advantages

It's faster!
It's smaller!
It makes you ready for Big Data and Distributed Computing

File size comparison

Ingesting CSVs


					from pyarrow import csv, Table

					pa_table = csv.read_csv('./data/bad.csv')
					# or
					pa_table = Table.from_pandas(pdDataFrame)
					

Writing Parquet Tables


					from pyarrow import csv, Table, parquet

					# By default the pyArrow library uses snappy compression
					parquet.write_table(pa_table, 'efficient.parquet')

					# To really squeeze the data, use gzip compression.
					parquet.write_table(pa_table, 'more_efficient.gzip.parquet',
						compression='gzip')
					

Reading Parquet


					from pyarrow import csv, Table, parquet

					# Reading from a parquet file is multi-threaded
					pa_table = parquet.read_table('efficient.parquet')
					# convert back to pandas
					df = pa_table.to_pandas()
					

More Reading Parquet

Only read the columns you need.

Parquet is columnar, only columns you pick are read
That means a smaller amount of data is accessed/downloaded/parsed


					pa_table_ids = pa.parquet.read_table('efficient.parquet',
						columns=['id', 'last_name'])
					
					df = pa_table_ids.to_pandas()
					

Getting fancy

Write datasets with partitioning by a primary key or datetime.
This speeds up reading and writing.

					import pyarrow as pa
					pq.write_to_dataset(table, root_path='dataset_name',
                    partition_cols=['one', 'two'])
					

Caveats

Not everything is perfect
  • You need to spend a little extra effort if you want your parquet to be 'Apache Spark ready'
  • Writing files can take a little longer (as they get compressed)
  • Converstion to/from pandas has some edge cases to be aware of

Thank you

This content was heavily inspired by the excellent documentation for the pyArrow project