pyarrow - w3toppers.com

How to read partitioned parquet files from S3 using pyarrow in python

I managed to get this working with the latest release of fastparquet & s3fs. Below is the code for the same: import s3fs import fastparquet as fp s3 = s3fs.S3FileSystem() fs = s3fs.core.S3FileSystem() #mybucket/data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet s3_path = “mybucket/data_folder/*/*/*.parquet” all_paths_from_s3 = fs.glob(path=s3_path) myopen = s3.open #use s3fs as the filesystem fp_obj = fp.ParquetFile(all_paths_from_s3,open_with=myopen) #convert to pandas dataframe … Read more

AWS EMR – ModuleNotFoundError: No module named ‘pyarrow’

What are the differences between feather and parquet?

Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then) Parquet is more expensive to write than Feather as it features more layers of encoding and … Read more

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

You should use the s3fs module as proposed by yjk21. However as result of calling ParquetDataset you’ll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you’ll rather want to apply .read_pandas().to_pandas() to it: import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() pandas_dataframe = pq.ParquetDataset(‘s3://your-bucket/’, filesystem=s3).read_pandas().to_pandas()