What to do with null values when modeling and normalizing?

SQL treats NULL specially per its version of 3VL (3-valued logic). Normalization & other relational theory does not. However, we can translate SQL designs into relational designs and back. (Assume no duplicate rows here.)

Normalization happens to relations and is defined in terms of operators that don’t treat NULL specially. The term “normalization” has two most common distinct meanings: putting a table into “1NF” and into “higher NFs (normal forms)”. NULL doesn’t affect “normalization to 1NF”. “Normalization to higher NFs” replaces a table by smaller tables that natural join back to it. For purposes of normalization you could treat NULL like a value that is allowed in the domain of a nullable column in addition to the values of its SQL type. If our SQL tables have no NULLs then we can interpret them as relations & SQL join etc as join, etc. But if you decompose where a nullable column was shared between components then realize that to reconstruct the original in SQL you have to SQL join on same-named columns being equal or both NULL. And you won’t want such CKs (candidate keys) in an SQL database. Eg you can’t declare it as an SQL PK (primary key) because that means UNIQUE NOT NULL. Eg a UNIQUE constraint involving a nullable column allows multiple rows that have a NULL in that column, even if the rows have the same values in every column. Eg NULLs in SQL FKs cause them to be satisfied (in various ways per MATCH mode), not to fail from not appearing in the referenced table. (But DBMSs idiosyncratically differ from standard SQL.)

Unfortunately decomposition might lead to a table with all CKs containing NULL, so that we have nothing to declare as SQL PK or UNIQUE NOT NULL. The only sure solution is to convert to a NULL-free design. After then normalizing we might want to reintroduce some nullability in the components.

In practice, we manage to design tables so that there is always a set of NULL-free columns that we can declare as CK, via SQL PK or UNIQUE NOT NULL. Then we can get rid of a nullable column by dropping it from the table and adding a table with that column and the columns of some NULL-free CK: If the column is non-NULL for a row in the old design then a row with its CK subrow and column value go in the added table; otherwise it is NULL in the old design and no corresponding row is in the added table. (The original table is a natural left join of the new ones.) Of course, we also have to modify queries from the old design to the new design.

We can always avoid NULLs via a design that adds a boolean column for each old nullable column and has the old column NOT NULL. The new column says for a row whether the old column was NULL in the old design and when true has the old column be some one value that we pick for that purpose for that type throughout the database. Of course, we also have to modify queries from the old design to the new design.

Whether you want to avoid NULL is a separate question. Your database might in some way be “better” or “worse” for your application with either design. The idea behind avoiding NULL is that it complicates the meanings of queries, hence complicates querying, in a perverse way, compared to the complication of more joins from more NULL-free tables. (That perversity is typically managed by removing NULLs in query expressions as close to where they appear as possible.)

PS Many SQL terms including PK & FK differ from the relational terms. SQL PK means something more like superkey; SQL FK means something more like foreign superkey; but it doesn’t even make sense to talk about a “superkey” in SQL:

Because of the resemblance of SQL tables to relations, terms that involve relations get sloppily applied to tables. But although you can borrow terms and give them SQL meanings–value, table, FD (functional dependency), superkey, CK (candidate key), PK (primary key), FK (foreign key), join, and, predicate, NF (normal form), normalize, 1NF, etc–you can’t just substitute those SQL meanings for those words in RM definitions, theorems or algorithms and get something sensible or true. Moreover SQL presentations of RM notions almost never actually tell you how to soundly apply RM notions to an SQL database. They just parrot RM presentations, oblivious to whether their use of SQL meanings for terms makes things nonsensical or invalid.

Leave a Comment