Provide schema while reading csv file as a dataframe in Scala Spark

Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file. val pagecount = sqlContext.read.format(“csv”) .option(“delimiter”,” “).option(“quote”,””) .option(“header”, “true”) .option(“inferSchema”, “true”) .load(“dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000”) If you want to manually specify the schema, you can do it as below: import org.apache.spark.sql.types._ val customSchema = StructType(Array( … Read more

Provide schema while reading csv file as a dataframe

Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file. val pagecount = sqlContext.read.format(“csv”) .option(“delimiter”,” “).option(“quote”,””) .option(“header”, “true”) .option(“inferSchema”, “true”) .load(“dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000”) If you want to manually specify the schema, you can do it as below: import org.apache.spark.sql.types._ val customSchema = StructType(Array( … Read more

Write single CSV file using spark-csv

It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle): df .repartition(1) .write.format(“com.databricks.spark.csv”) .option(“header”, “true”) .save(“mydata.csv”) or coalesce: df .coalesce(1) .write.format(“com.databricks.spark.csv”) .option(“header”, “true”) .save(“mydata.csv”) data frame before … Read more