Reading DataFrame from partitioned parquet file

sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and day=6, you can simply add two paths like: val dataframe = sqlContext .read.parquet(“file:///your/path/data=jDD/year=2015/month=10/day=5/”, “file:///your/path/data=jDD/year=2015/month=10/day=6/”) If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe. EDIT: As of Spark 1.6 one needs to … Read more

Spark Dataframe validating column names for parquet writes

For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this: file = “/opt/myfile.parquet” df = spark.read.parquet(file) for c in df.columns: df = df.withColumnRenamed(c, c.replace(” “, “”)) df = spark.read.schema(df.schema).parquet(file)