apache-spark - w3toppers.com

Spark ALS predictAll returns empty

There are two basic conditions under which MatrixFactorizationMode.predictAll may return a RDD with lower number of items than the input: user is missing in the training set. product is missing in the training set. You can easily reproduce this behavior and check that it is is not dependent on the way how RDD has been … Read more

How to calculate Median in spark sqlContext for column of data type double

For non integral values you should use percentile_approx UDF: import org.apache.spark.mllib.random.RandomRDDs val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF(“x”) df.registerTempTable(“df”) sqlContext.sql(“SELECT percentile_approx(x, 0.5) FROM df”).show // +——————–+ // | _c0| // +——————–+ // |0.035379710486199915| // +——————–+ On a side not you should use GROUP BY not PARTITION BY. Latter one is used for window functions and … Read more

How to improve broadcast Join speed with between condition in Spark

The trick is to rewrite join condition so it contains = component which can be used to optimize the query and narrow down possible matches. For numeric values you bucketize your data and use buckets for join condition. Let’s say your data looks like this: val a = spark.range(100000) .withColumn(“cur”, (rand(1) * 1000).cast(“bigint”)) val b … Read more

When to use Spark DataFrame/Dataset API and when to use plain RDD?

From this Databricks’ blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets When to use RDDs? Consider these scenarios or common use cases for using RDDs when: you want low-level transformation and actions and control on your dataset; your data is unstructured, such as media streams or streams of text; you … Read more

What does “Correlated scalar subqueries must be Aggregated” mean?

You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement. So when catalyst can’t make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, … Read more

How to turn off scientific notation in pyspark?

The easiest way is to cast double column to decimal, giving appropriate precision and scale: df.withColumn(‘total_sale_volume’, df.total_sale_volume.cast(DecimalType(18, 2)))

Convert between spark.SQL DataFrame and pandas DataFrame [duplicate]

Try: spark_df.toPandas() toPandas() Returns the contents of this DataFrame as Pandas pandas.DataFrame. This is only available if Pandas is installed and available. And if you want the oposite: spark_df = createDataFrame(pandas_df)

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

That’s the line in your Physical Plan you should remember to know the real difference between Dataset[T] and DataFrame (which is Dataset[Row]). Filter <function1>.apply I keep saying that people should stay away from the typed Dataset API and keep using the untyped DataFrame API as the Scala code becomes a black box to the optimizer … Read more

How to use groupBy to collect rows into a map?

Following will work with Spark 2.0. You can use map function available since 2.0 release to get columns as Map. val df1 = df.groupBy(col(“school_name”)).agg(collect_list(map($”name”,$”age”)) as “map”) df1.show(false) This will give you below output. +———–+————————————+ |school_name|map | +———–+————————————+ |school B |[Map(cathy -> 10), Map(shaun -> 5)] | |school A |[Map(michael -> 7), Map(emily -> 5)]| +———–+————————————+ … Read more

Get CSV to Spark dataframe

With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method: df = sqlContext.read.csv(“/path/to/your.csv”) Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. … Read more