Get CSV to Spark dataframe

With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:

df = sqlContext.read.csv("/path/to/your.csv")

Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.

More Related Contents:

How to melt Spark DataFrame?
Using a column value as a parameter to a spark DataFrame function
Unpivot in spark-sql/pyspark
java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.(Unknown Source) with Java 10
Split Spark Dataframe string column into multiple columns
Avoid performance impact of a single partition mode in Spark window functions
How to check if spark dataframe is empty?
TypeError: Column is not iterable – How to iterate over ArrayType()?
How to split a list to multiple columns in Pyspark?
pyspark dataframe filter or include based on list
How to fix ‘TypeError: an integer is required (got type bytes)’ error when trying to run pyspark after installing spark 2.4.4
Pyspark : forward fill with last observation for a DataFrame
Adding a group count column to a PySpark dataframe
Efficient pyspark join
Multiple Spark applications with HiveContext
How to loop through each row of dataFrame in pyspark
Fill in null with previously known good value with pyspark
Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
PySpark – get row number for each row in a group
Save ML model for future usage
Pyspark: Pass multiple columns in UDF
Keep only duplicates from a DataFrame regarding some field
reading json file in pyspark
How can I access python variable in Spark SQL?
PySpark: How to fillna values in dataframe for specific columns?
pyspark: rolling average using timeseries data
Spark ALS predictAll returns empty
Why is the fold action necessary in Spark?
How to add a SparkListener from pySpark in Python?
PySpark error: AttributeError: ‘NoneType’ object has no attribute ‘_jvm’

More Related Contents:

Leave a Comment Cancel reply