PandasDoingParquet

Pandas Doing Parquet

A data science world without CSVs

Why?
Ingesting CSV
Writing parquet
Reading parquet
Getting fancy

How is Parquet superior to CSV

CSV's limitations

Uncompressed (big files)
Whole file is read before accessing one column
> Similar to a 'Full Table Scan' in a database
Reading is single threaded (slow)
Not suited to distributed computing

Why doesn't it work for distributed computing (DC)

DC works by spliting and sharing data across many machines

CSVs aren't partitioned (naturally splitable)
Column headings aren't native
Column datatypes aren't explicit
Column headings are mixed in with data

Parquet's advantages

It's faster!
It's smaller!
It makes you ready for Big Data and Distributed Computing

File size comparison

Ingesting CSVs


					from pyarrow import csv, Table

					pa_table = csv.read_csv('./data/bad.csv')
					# or
					pa_table = Table.from_pandas(pdDataFrame)

Writing Parquet Tables


					from pyarrow import csv, Table, parquet

					# By default the pyArrow library uses snappy compression
					parquet.write_table(pa_table, 'efficient.parquet')

					# To really squeeze the data, use gzip compression.
					parquet.write_table(pa_table, 'more_efficient.gzip.parquet',
						compression='gzip')

Reading Parquet


					from pyarrow import csv, Table, parquet

					# Reading from a parquet file is multi-threaded
					pa_table = parquet.read_table('efficient.parquet')
					# convert back to pandas
					df = pa_table.to_pandas()

Getting fancy

Write datasets with partitioning by a primary key or datetime.
This speeds up reading and writing.


					import pyarrow as pa
					pq.write_to_dataset(table, root_path='dataset_name',
                    partition_cols=['one', 'two'])

Caveats

Not everything is perfect

You need to spend a little extra effort if you want your parquet to be 'Apache Spark ready'
Writing files can take a little longer (as they get compressed)
Converstion to/from pandas has some edge cases to be aware of

Thank you

This content was heavily inspired by the excellent documentation for the pyArrow project