How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

You should use the s3fs module as proposed by yjk21. However as result of calling ParquetDataset you’ll get a pyarrow.parquet.ParquetDataset object. To get the Pandas DataFrame you’ll rather want to apply .read_pandas().to_pandas() to it: import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() pandas_dataframe = pq.ParquetDataset(‘s3://your-bucket/’, filesystem=s3).read_pandas().to_pandas()

Retrieving subfolders names in S3 bucket from boto3

Below piece of code returns ONLY the ‘subfolders’ in a ‘folder’ from s3 bucket. import boto3 bucket=”my-bucket” #Make sure you provide / in the end prefix = ‘prefix-name-with-slash/’ client = boto3.client(‘s3’) result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter=”https://stackoverflow.com/”) for o in result.get(‘CommonPrefixes’): print ‘sub folder : ‘, o.get(‘Prefix’) For more details, you can refer to https://github.com/boto/boto3/issues/134

Boto3 Error: botocore.exceptions.NoCredentialsError: Unable to locate credentials

try specifying keys manually s3 = boto3.resource(‘s3′, aws_access_key_id=ACCESS_ID, aws_secret_access_key= ACCESS_KEY) Make sure you don’t include your ACCESS_ID and ACCESS_KEY in the code directly for security concerns. Consider using environment configs and injecting them in the code as suggested by @Tiger_Mike. For Prod environments consider using rotating access keys: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_RotateAccessKey

check if a key exists in a bucket in s3 using boto3

Boto 2’s boto.s3.key.Key object used to have an exists method that checked if the key existed on S3 by doing a HEAD request and looking at the the result, but it seems that that no longer exists. You have to do it yourself: import boto3 import botocore s3 = boto3.resource(‘s3’) try: s3.Object(‘my-bucket’, ‘dootdoot.jpg’).load() except botocore.exceptions.ClientError … Read more

Read file content from S3 bucket with boto3

boto3 offers a resource model that makes tasks like iterating through objects easier. Unfortunately, StreamingBody doesn’t provide readline or readlines. s3 = boto3.resource(‘s3’) bucket = s3.Bucket(‘test-bucket’) # Iterates through all the objects, doing the pagination for you. Each obj # is an ObjectSummary, so it doesn’t contain the body. You’ll need to call # get … Read more