Inspect Parquet from command line

You can use parquet-tools with the command cat and the –json option in order to view the files without a local copy and in the JSON format. Here is an example: parquet-tools cat –json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet This prints out the data in JSON format: {“name”:”gil”,”age”:48,”city”:”london”} {“name”:”jane”,”age”:30,”city”:”new york”} {“name”:”jordan”,”age”:18,”city”:”toronto”} Disclaimer: this was tested in Cloudera CDH 5.12.0

How to convert a csv file to parquet

I already posted an answer on how to do this using Apache Drill. However, if you are familiar with Python, you can now do this using Pandas and PyArrow! Install dependencies Using pip: pip install pandas pyarrow or using conda: conda install pandas pyarrow -c conda-forge Convert CSV to Parquet in chunks # csv_to_parquet.py import … Read more

How to read partitioned parquet files from S3 using pyarrow in python

I managed to get this working with the latest release of fastparquet & s3fs. Below is the code for the same: import s3fs import fastparquet as fp s3 = s3fs.S3FileSystem() fs = s3fs.core.S3FileSystem() #mybucket/data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet s3_path = “mybucket/data_folder/*/*/*.parquet” all_paths_from_s3 = fs.glob(path=s3_path) myopen = s3.open #use s3fs as the filesystem fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen) #convert to pandas dataframe … Read more

How to read a Parquet file into Pandas DataFrame?

pandas 0.21 introduces new functions for Parquet: import pandas as pd pd.read_parquet(‘example_pa.parquet’, engine=”pyarrow”) or import pandas as pd pd.read_parquet(‘example_fp.parquet’, engine=”fastparquet”) The above link explains: These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

Reading parquet files from multiple directories in Pyspark

A little late but I found this while I was searching and it may help someone else… You might also try unpacking the argument list to spark.read.parquet() paths=[‘foo’,’bar’] df=spark.read.parquet(*paths) This is convenient if you want to pass a few blobs into the path argument: basePath=”s3://bucket/” paths=[‘s3://bucket/partition_value1=*/partition_value2=2017-04-*’, ‘s3://bucket/partition_value1=*/partition_value2=2017-05-*’ ] df=spark.read.option(“basePath”,basePath).parquet(*paths) This is cool cause you don’t … Read more

Why are Spark Parquet files for an aggregate larger than the original?

In general columnar storage formats like Parquet are highly sensitive when it comes to data distribution (data organization) and cardinality of individual columns. The more organized is data and the lower is cardinality the more efficient is the storage. Aggregation, as the one you apply, has to shuffle the data. When you check the execution … Read more

Is it better to have one large parquet file or lots of smaller parquet files?

Aim for around 1GB per file (spark partition) (1). Ideally, you would use snappy compression (default) due to snappy compressed parquet files being splittable (2). Using snappy instead of gzip will significantly increase the file size, so if storage space is an issue, that needs to be considered. .option(“compression”, “gzip”) is the option to override … Read more