Is there any rule of thumb to construct SQL query from a human-readable description?

Is there any systematic step-by-step or mathematical way to construct
SQL query from a given human-readable description?

Yes, there is.

It turns out that natural language expressions and logical expressions and relational algebra expressions and SQL expressions (a hybrid of the last two) correspond in a rather direct way. (What follows is for no duplicate rows & no nulls.)

Each table (base or query result) has an associated predicate–a natural language fill-in-the-(named-)blanks statement template parameterized by column names.

[liker] likes [liked]

A table holds every row that, using the row’s column values to fill in the (named) blanks, makes a true statement aka proposition. Here’s a table with that predicate & its rows’ propositions:

liker  | liked
--------------
Bob    | Dex    /* Bob likes Dex */
Bob    | Alice  /* Bob likes Alice */
Alice  | Carol  /* Alice likes Carol */

Each proposition from filling a predicate with the values from a row in a table is true. And each proposition from filling a predicate with the values from a row not in a table is false. Here’s what that table says:

/*
    Alice likes Carol
AND NOT Alice likes Alice
AND NOT Alice likes Bob
AND NOT Alice likes Dex
AND NOT Alice likes Ed
...
AND Bob likes Alice
AND Bob likes Dex
AND NOT Bob likes Bob
AND NOT Bob likes Carol
AND NOT Bob likes Ed
...
AND NOT Carol likes Alice
...
AND NOT Dex likes Alice
...
AND NOT Ed likes Alice
...
*/   

The DBA gives the predicate for each base table. The SQL syntax for a table declaration is a lot like the traditional logic shorthand for the natural language version of a given predicate. Here’s a declaration of a base table to hold our value:

/* (person, liked) rows where [liker] likes [liked] */
/* (person, liked) rows where Likes(liker, liked) */
CREATE TABLE Likes (
    liker ...,
    liked ...
);

An SQL query (sub)expression transforms argument table values to a new table value holding the rows that make a true statement from a new predicate. The new table predicate can be expressed in terms of the argument table predicate(s) according to the (sub)expression’s relational/table operators. A query is an SQL expression whose predicate is the predicate for the table of rows we want.

When we give a table & (possibly implicit) alias A to be joined, the operator acts on a value & predicate like the table’s but with columns renamed from C,... to A.C,.... Then

  • R , S & R CROSS JOIN S are rows where the predicate of R AND the predicate of S

  • R INNER JOIN S ON condition is rows where the predicate of R AND the predicate of S AND condition

  • R LEFT JOIN S ON condition is rows where (for S-only columns S1,…)

        the predicate of R AND the predicate of S AND condition
    OR
            the predicate of R
        AND NOT FOR SOME values for S1,... [the predicate of S AND condition]
        AND S1 IS NULL AND ...
    
  • R WHERE condition is rows where the predicate of R AND condition

  • SELECT DISTINCT A.C AS D,... FROM R (maybe with implicit A. and/or implicit AS D) is rows where

    • FOR SOME values for A.*,... [A.C=D AND ... AND the predicate of R] (This can be less compact but looks more like the SQL.)
    • if there are no dropped columns, the predicate of R with A.C,... replaced by D,...
    • if there are dropped columns, FOR SOME values for the dropped columns [ the predicate of R with A.C,... replaced by D,... ]
  • (X,...) IN (R) means

    • the predicate of R with columns C,... replaced by X,...
    • (X,...) IN R

Example: Natural language for (person, liked) rows where [person] is Bob and Bob likes someone who likes [liked] but who doesn’t like Ed:

/* (person, liked) rows where
FOR SOME value for x,
        [person] likes [x]
    and [x] likes [liked]
    and [person] = 'Bob'
    and not [x] likes 'Ed'
*/

Rewrite using shorthand predicates:

/* (person, liked) rows where
FOR SOME value for x,
        Likes(person, x)
    AND Likes(x, liked)
    AND person = 'Bob'
    AND NOT Likes(x, 'Ed')
*/

Rewrite using only shorthand predicates of base & aliased tables:

/* (person, liked) rows where
FOR SOME values for l1.*, l2.*,
        person = l1.liker AND liked = l2.liked
    AND Likes(l1.liker, l1.liked)
    AND Likes(l2.liker, l2.liked)
    AND l1.liked = l2.liker
    AND person = 'Bob'
    AND NOT (l1.liked, 'Ed') IN Likes
*/

Rewrite in SQL:

SELECT DISTINCT l1.liker AS person, l2.liked AS liked
    /* (l1.liker, l1.liked, l2.liker, l2.liked) rows where
        Likes(l1.liker, l1.liked)
    AND Likes(l2.liker, l2.liked)
    AND l1.liked = l2.liker
    AND l1.liker="Bob"
    AND NOT (l1.liked, 'Ed') IN Likes
    */
FROM Likes l1
INNER JOIN Likes l2
ON l1.liked = l2.liker
WHERE l1.liker="Bob"
AND NOT (l1.liked, 'Ed') IN (SELECT * FROM Likes)

Similarly,

  • R UNION CORRESPONDING S is rows where the predicate of R OR the predicate of R

  • R UNION S is rows where the predicate of R OR the predicate we get by replacing the columns of S by the columns of R in the predicate of R

  • VALUES (X,...), ... with columns C,... is rows where C = X AND ... OR ...

Example:

/* (person) rows where
    FOR SOME value for liked, Likes(person, liked)
OR  person = 'Bob'
*/

    SELECT liker AS person
    FROM Likes
UNION
    VALUES ('Bob')

So if we express our desired rows in terms of given base table natural language statement templates that rows make true or false (to be returned or not) then we can translate to SQL queries that are nestings of logic shorthands & operators and/or table names & operators. And then the DBMS can convert totally to tables to calculate the rows making our predicate true.

See How to get matching data from another SQL table for two different columns: Inner Join and/or Union? re applying this to SQL. (Another self-join.)
See Relational algebra for banking scenario for more on natural language phrasings. (In a relational algebra context.)
See Null in Relational Algebra for another presentation of relational querying.

Leave a Comment