Do not collect
this data, it’s likely it will never fit in memory as you are trying to pull all the data into the driver.
You can just save it to a file directly.
collected_data = zip_rdd.map(extract_files).toDF("column","names","go","here")
collected_data.write.parquet("/path/to/folder")