How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

You should use the s3fs module as proposed by yjk21. However as result of calling ParquetDataset you’ll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you’ll rather want to apply .read_pandas().to_pandas() to it:

import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()

More Related Contents:

Why does the following code cant get value of n_fold = 1?
How to change the order of DataFrame columns?
Filter dataframe rows if value in column is in a set list of values [duplicate]
Add missing dates to pandas dataframe
Converting a Pandas GroupBy output from Series to DataFrame
Pandas: sum DataFrame rows for given columns
String concatenation of two pandas columns
datetime dtypes in pandas read_csv
Pandas DataFrame: replace all values in a column, based on condition
Pandas – Slice large dataframe into chunks
Calculate average of every x rows in a table and create new table
Find column name in pandas that matches an array
Removing prefix from column names in Pandas
pandas get the row-wise minimum value of two or more columns
Set maximum value (upper bound) in pandas DataFrame
Normalizing pandas DataFrame rows by their sums
How to merge multiple dataframes
How can I get a value from a cell of a dataframe?
Return multiple columns from pandas apply()
How to slice a pandas DataFrame by position?
Python Pandas – Merge based on substring in string
Filling in date gaps in MultiIndex Pandas Dataframe
Writing a Python Pandas DataFrame to Word document
Find unique values in a Pandas dataframe, irrespective of row or column location
Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)
Selecting a range of columns in a dataframe
Replace NaN with empty list in a pandas dataframe
Get weekday/day-of-week for Datetime column of DataFrame
Get max value from row of a dataframe in python [duplicate]
When is it appropriate to use df.value_counts() vs df.groupby(‘…’).count()?

More Related Contents:

Leave a Comment Cancel reply