How to make JSON flattening memory efficient?

Do not collect this data, it’s likely it will never fit in memory as you are trying to pull all the data into the driver.

You can just save it to a file directly.

collected_data = zip_rdd.map(extract_files).toDF("column","names","go","here")
collected_data.write.parquet("/path/to/folder")

Leave a Comment