Difference between === null and isNull in Spark DataDrame

First and foremost don’t use null in your Scala code unless you really have to for compatibility reasons.

Regarding your question it is plain SQL. col("c1") === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself.

spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
|         null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
|               null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
|                                null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
|                          null|
+------------------------------+

The only valid methods to check for NULL are:

  • IS NULL:

    spark.sql("SELECT NULL IS NULL").show
    
    +--------------+
    |(NULL IS NULL)|
    +--------------+
    |          true|
    +--------------+
    
    spark.sql("SELECT TRUE IS NULL").show
    
    +--------------+
    |(true IS NULL)|
    +--------------+
    |         false|
    +--------------+
    
  • IS NOT NULL:

    spark.sql("SELECT NULL IS NOT NULL").show
    
    +------------------+
    |(NULL IS NOT NULL)|
    +------------------+
    |             false|
    +------------------+
    
    spark.sql("SELECT TRUE IS NOT NULL").show
    
    +------------------+
    |(true IS NOT NULL)|
    +------------------+
    |              true|
    +------------------+
    

implemented in DataFrame DSL as Column.isNull and Column.isNotNull respectively.

Note:

For NULL-safe comparisons use IS DISTINCT / IS NOT DISTINCT:

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
|                           false|
+--------------------------------+

or not(_ <=> _) / <=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|          false|
+---------------+

in SQL and DataFrame DSL respectively.

Related:

Including null values in an Apache Spark Join

Leave a Comment