What does "Correlated scalar subqueries must be Aggregated" mean?

You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.

So when catalyst can’t make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.

If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:

first
avg
max
min

More Related Contents:

Finding duplicates from large data set using Apache Spark
Find maximum row per group in Spark DataFrame
How to connect Spark SQL to remote Hive metastore (via thrift protocol) with no hive-site.xml?
While writing to hdfs path getting error java.io.IOException: Failed to rename
Multiple Aggregate operations on the same column of a spark dataframe
Partitioning in spark while reading from RDBMS via JDBC
Spark SQL – load data with JDBC using SQL statement, not table name
How to access element of a VectorUDT column in a Spark DataFrame?
How to optimize partitioning when migrating data from JDBC source?
Filtering a spark dataframe based on date
How to save/insert each DStream into a permanent table
Spark DataFrame Schema Nullable Fields
Spark load data and add filename as dataframe column
PySpark: how to resample frequencies
How does createOrReplaceTempView work in Spark?
Why does Spark think this is a cross / Cartesian join
How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
Spark SQL broadcast hash join
What is the difference between Apache Spark SQLContext vs HiveContext?
What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
pyspark: count distinct over a window
Spark lists all leaf node even in partitioned data
Array Intersection in Spark SQL
reduce result datasets into single dataset
How to exclude multiple columns in Spark dataframe in Python
How to get Kafka offsets for structured query for manual and reliable offset management?
Stratified sampling with pyspark
How to use groupBy to collect rows into a map?
When to use Spark DataFrame/Dataset API and when to use plain RDD?
What is an optimized way of joining large tables in Spark SQL

What does “Correlated scalar subqueries must be Aggregated” mean?

Leave a Comment Cancel reply