What are the various join types in Spark?

[*] Here is a simple illustrative experiment: import org.apache.spark.sql._ object SparkSandbox extends App { implicit val spark = SparkSession.builder().master(“local[*]”).getOrCreate() import spark.implicits._ spark.sparkContext.setLogLevel(“ERROR”) val left = Seq((1, “A1”), (2, “A2”), (3, “A3”), (4, “A4”)).toDF(“id”, “value”) val right = Seq((3, “A3”), (4, “A4”), (4, “A4_1”), (5, “A5”), (6, “A6”)).toDF(“id”, “value”) println(“LEFT”) left.orderBy(“id”).show() println(“RIGHT”) right.orderBy(“id”).show() val joinTypes = … Read more

Reading csv files with quoted fields containing embedded commas

I noticed that your problematic line has escaping that uses double quotes themselves: “32 XIY “”W”” JK, RE LK” which should be interpreter just as 32 XIY “W” JK, RE LK As described in RFC-4180, page 2 – If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped … Read more

Spark parquet partitioning : Large number of files

First I would really avoid using coalesce, as this is often pushed up further in the chain of transformation and may destroy the parallelism of your job (I asked about this issue here : Coalesce reduces parallelism of entire stage (spark)) Writing 1 file per parquet-partition is realtively easy (see Spark dataframe write method writing … Read more

Spark 2.0 Dataset vs DataFrame

Difference between df.select(“foo”) and df.select($”foo”) is signature. The former one takes at least one String, the later one zero or more Columns. There is no practical difference beyond that. myDataSet.map(foo.someVal) type checks, but as any Dataset operation uses RDD of objects, and compared to DataFrame operations, there is a significant overhead. Let’s take a look … Read more