How to calculate Median in spark sqlContext for column of data type double

For non integral values you should use percentile_approx UDF: import org.apache.spark.mllib.random.RandomRDDs val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF(“x”) df.registerTempTable(“df”) sqlContext.sql(“SELECT percentile_approx(x, 0.5) FROM df”).show // +——————–+ // | _c0| // +——————–+ // |0.035379710486199915| // +——————–+ On a side not you should use GROUP BY not PARTITION BY. Latter one is used for window functions and … Read more

How to improve broadcast Join speed with between condition in Spark

The trick is to rewrite join condition so it contains = component which can be used to optimize the query and narrow down possible matches. For numeric values you bucketize your data and use buckets for join condition. Let’s say your data looks like this: val a = spark.range(100000) .withColumn(“cur”, (rand(1) * 1000).cast(“bigint”)) val b … Read more

When to use Spark DataFrame/Dataset API and when to use plain RDD?

From this Databricks’ blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets When to use RDDs? Consider these scenarios or common use cases for using RDDs when: you want low-level transformation and actions and control on your dataset; your data is unstructured, such as media streams or streams of text; you … Read more

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

That’s the line in your Physical Plan you should remember to know the real difference between Dataset[T] and DataFrame (which is Dataset[Row]). Filter <function1>.apply I keep saying that people should stay away from the typed Dataset API and keep using the untyped DataFrame API as the Scala code becomes a black box to the optimizer … Read more

How to use groupBy to collect rows into a map?

Following will work with Spark 2.0. You can use map function available since 2.0 release to get columns as Map. val df1 = df.groupBy(col(“school_name”)).agg(collect_list(map($”name”,$”age”)) as “map”) df1.show(false) This will give you below output. +———–+————————————+ |school_name|map | +———–+————————————+ |school B |[Map(cathy -> 10), Map(shaun -> 5)] | |school A |[Map(michael -> 7), Map(emily -> 5)]| +———–+————————————+ … Read more

Get CSV to Spark dataframe

With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method: df = sqlContext.read.csv(“/path/to/your.csv”) Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. … Read more