subquery or leftjoin with group by which one is faster?

A great resource on calculating running totals in SQL Server is this document by Itzik Ben Gan that was submitted to the SQL Server Team as part of his campaign to have the OVER clause extended further from its initial SQL Server 2005 implementation. In it he shows how once you get into tens of thousands of rows cursors out perform set based solutions. SQL Server 2012 did indeed extend the OVER clause making this sort of query much easier.

SELECT col1,
       SUM(col1) OVER (ORDER BY ind ROWS UNBOUNDED PRECEDING)
FROM   @tmp 

As you are on SQL Server 2005 however this is not available to you.

Adam Machanic shows here how the CLR can be used to improve on the performance of standard TSQL cursors.

For this table definition

CREATE TABLE RunningTotals
(
ind int identity(1,1) primary key,
col1 int
)

I create tables with both 2,000 and 10,000 rows in a database with ALLOW_SNAPSHOT_ISOLATION ON and one with this setting off (The reason for this is because my initial results were in a DB with the setting on that led to a puzzling aspect of the results).

The clustered indexes for all tables just had 1 root page. The number of leaf pages for each is shown below.

+-------------------------------+-----------+------------+
|                               | 2,000 row | 10,000 row |
+-------------------------------+-----------+------------+
| ALLOW_SNAPSHOT_ISOLATION OFF  |         5 |         22 |
| ALLOW_SNAPSHOT_ISOLATION ON   |         8 |         39 |
+-------------------------------+-----------+------------+

I tested the following cases (Links show execution plans)

  1. Left Join and Group By
  2. Correlated subquery 2000 row plan,10000 row plan
  3. CTE from Mikael’s (updated) answer
  4. CTE below

The reason for inclusion of the additional CTE option was in order to provide a CTE solution that would still work if the ind column was not guaranteed sequential.

SET STATISTICS IO ON;
SET STATISTICS TIME ON;
DECLARE @col1 int, @sumcol1 bigint;

WITH    RecursiveCTE
AS      (
        SELECT TOP 1 ind, col1, CAST(col1 AS BIGINT) AS Total
        FROM RunningTotals
        ORDER BY ind
        UNION   ALL
        SELECT  R.ind, R.col1, R.Total
        FROM    (
                SELECT  T.*,
                        T.col1 + Total AS Total,
                        rn = ROW_NUMBER() OVER (ORDER BY T.ind)
                FROM    RunningTotals T
                JOIN    RecursiveCTE R
                        ON  R.ind < T.ind
                ) R
        WHERE   R.rn = 1
        )
SELECT  @col1 =col1, @sumcol1=Total
FROM    RecursiveCTE
OPTION  (MAXRECURSION 0);

All of the queries had a CAST(col1 AS BIGINT) added in order to avoid overflow errors at runtime. Additionally for all of them I assigned the results to variables as above in order to eliminate time spent sending back results from consideration.

Results

+------------------+----------+--------+------------+---------------+------------+---------------+-------+---------+
|                  |          |        |          Base Table        |         Work Table         |     Time        |
+------------------+----------+--------+------------+---------------+------------+---------------+-------+---------+
|                  | Snapshot | Rows   | Scan count | logical reads | Scan count | logical reads | cpu   | elapsed |
| Group By         | On       | 2,000  | 2001       | 12709         |            |               | 1469  | 1250    |
|                  | On       | 10,000 | 10001      | 216678        |            |               | 30906 | 30963   |
|                  | Off      | 2,000  | 2001       | 9251          |            |               | 1140  | 1160    |
|                  | Off      | 10,000 | 10001      | 130089        |            |               | 29906 | 28306   |
+------------------+----------+--------+------------+---------------+------------+---------------+-------+---------+
| Sub Query        | On       | 2,000  | 2001       | 12709         |            |               | 844   | 823     |
|                  | On       | 10,000 | 2          | 82            | 10000      | 165025        | 24672 | 24535   |
|                  | Off      | 2,000  | 2001       | 9251          |            |               | 766   | 999     |
|                  | Off      | 10,000 | 2          | 48            | 10000      | 165025        | 25188 | 23880   |
+------------------+----------+--------+------------+---------------+------------+---------------+-------+---------+
| CTE No Gaps      | On       | 2,000  | 0          | 4002          | 2          | 12001         | 78    | 101     |
|                  | On       | 10,000 | 0          | 20002         | 2          | 60001         | 344   | 342     |
|                  | Off      | 2,000  | 0          | 4002          | 2          | 12001         | 62    | 253     |
|                  | Off      | 10,000 | 0          | 20002         | 2          | 60001         | 281   | 326     |
+------------------+----------+--------+------------+---------------+------------+---------------+-------+---------+
| CTE Alllows Gaps | On       | 2,000  | 2001       | 4009          | 2          | 12001         | 47    | 75      |
|                  | On       | 10,000 | 10001      | 20040         | 2          | 60001         | 312   | 413     |
|                  | Off      | 2,000  | 2001       | 4006          | 2          | 12001         | 94    | 90      |
|                  | Off      | 10,000 | 10001      | 20023         | 2          | 60001         | 313   | 349     |
+------------------+----------+--------+------------+---------------+------------+---------------+-------+---------+

Both the correlated subquery and the GROUP BY version use “triangular” nested loop joins driven by a clustered index scan on the RunningTotals table (T1) and, for each row returned by that scan, seeking back into the table (T2) self joining on T2.ind<=T1.ind.

This means that the same rows get processed repeatedly. When the T1.ind=1000 row is processed the self join retrieves and sums all rows with an ind <= 1000, then for the next row where T1.ind=1001 the same 1000 rows are retrieved again and summed along with one additional row and so on.

The total number of such operations for a 2,000 row table is 2,001,000, for 10k rows 50,005,000 or more generally (n² + n) / 2 which clearly grows exponentially.

In the 2,000 row case the main difference between the GROUP BY and the subquery versions is that the former has the stream aggregate after the join and so has three columns feeding into it (T1.ind, T2.col1, T2.col1) and a GROUP BY property of T1.ind whereas the latter is calculated as a scalar aggregate, with the stream aggregate before the join, only has T2.col1 feeding into it and has no GROUP BY property set at all. This simpler arrangement can be seen to have a measurable benefit in terms of reduced CPU time.

For the 10,000 row case there is an additional difference in the sub query plan. It adds an eager spool which copies all the ind,cast(col1 as bigint) values into tempdb. In the case that snapshot isolation is on this works out more compact than the clustered index structure and the net effect is to reduce the number of reads by about 25% (as the base table preserves quite a lot of empty space for versioning info), when this option is off it works out less compact (presumably due to the bigint vs int difference) and more reads result. This reduces the gap between the sub query and group by versions but the sub query still wins.

The clear winner however was the Recursive CTE. For the “no gaps” version logical reads from the base table are now 2 x (n + 1) reflecting the n index seeks into the 2 level index to retrieve all of the rows plus the additional one at the end that returns nothing and terminates the recursion. That still meant 20,002 reads to process a 22 page table however!

Logical work table reads for the recursive CTE version are very high. It seems to work out at 6 worktable reads per source row. These come from the index spool that stores the output of the previous row then is read from again in the next iteration (good explanation of this by Umachandar Jayachandran here). Despite the high number this is still the best performer.

Leave a Comment