Why using a UDF in a SQL query leads to cartesian product?

Why using UDFs leads to a Cartesian product instead of a full outer join?

The reason why using UDFs require Cartesian product is quite simple. Since you pass an arbitrary function with possibly infinite domain and non-deterministic behavior the only way to determine its value is to pass arguments and evaluate. It means you simply have to check all possible pairs.

Simple equality from the other hand has a predictable behavior. If you use t1.foo = t2.bar condition you can simply shuffle t1 and t2 rows by foo and bar respectively to get expected result.

And just to be precise in the relational algebra outer join is actually expressed using natural join. Anything beyond that is simply an optimization.

Any way to force an outer join over the Cartesian product

Not really, unless you want to modify Spark SQL engine.

Leave a Comment