How Can I Detect and Bound Changes Between Row Values in a SQL Table?

Question

Finding “ToTime” By Aggregates Instead of a Join

I would like to share a really wild query that only takes 1 scan of the table with 1 logical read. By comparison, the best other answer on the page, Simon Kingston’s query, takes 2 scans.

On a very large set of data (17,408 input rows, producing 8,193 result rows) it takes CPU 574 and time 2645, while Simon Kingston’s query takes CPU 63,820 and time 37,108.

It’s possible that with indexes the other queries on the page could perform many times better, but it is interesting to me to achieve 111x CPU improvement and 14x speed improvement just by rewriting the query.

(Please note: I mean no disrespect at all to Simon Kingston or anyone else; I am simply excited about my idea for this query panning out so well. His query is better than mine as its performance is plenty and it actually is understandable and maintainable, unlike mine.)

Here is the impossible query. It is hard to understand. It was hard to write. But it is awesome. 🙂

WITH Ranks AS (
   SELECT
      T = Dense_Rank() OVER (ORDER BY Time, Num),
      N = Dense_Rank() OVER (PARTITION BY Name ORDER BY Time, Num),
      *
   FROM
      #Data D
      CROSS JOIN (
         VALUES (1), (2)
      ) X (Num)
), Items AS (
   SELECT
      FromTime = Min(Time),
      ToTime = Max(Time),
      Name = IsNull(Min(CASE WHEN Num = 2 THEN Name END), Min(Name)),
      I = IsNull(Min(CASE WHEN Num = 2 THEN T - N END), Min(T - N)),
      MinNum = Min(Num)
   FROM
      Ranks
   GROUP BY
      T / 2
)
SELECT
   FromTime = Min(FromTime),
   ToTime = CASE WHEN MinNum = 2 THEN NULL ELSE Max(ToTime) END,
   Name
FROM Items
GROUP BY
   I, Name, MinNum
ORDER BY
   FromTime

Note: This requires SQL 2008 or up. To make it work in SQL 2005, change the VALUES clause to SELECT 1 UNION ALL SELECT 2.

Updated Query

After thinking about this a bit, I realized that I was accomplishing two separate logical tasks at the same time, and this made the query unnecessarily complicated: 1) prune out intermediate rows that have no bearing on the final solution (rows that do not begin a new task) and 2) pull the “ToTime” value from the next row. By performing #1 before #2, the query is simpler and performs with approximately half the CPU!

So here is the simplified query that first, trims out the rows we don’t care about, then gets the ToTime value using aggregates rather than a JOIN. Yes, it does have 3 windowing functions instead of 2, but ultimately because of the fewer rows (after pruning those we don’t care about) it has less work to do:

WITH Ranks AS (
   SELECT
      Grp =
         Row_Number() OVER (ORDER BY Time)
         - Row_Number() OVER (PARTITION BY Name ORDER BY Time),
      [Time], Name
   FROM #Data D
), Ranges AS (
   SELECT
      Result = Row_Number() OVER (ORDER BY Min(R.[Time]), X.Num) / 2,
      [Time] = Min(R.[Time]),
      R.Name, X.Num
   FROM
      Ranks R
      CROSS JOIN (VALUES (1), (2)) X (Num)
   GROUP BY
      R.Name, R.Grp, X.Num
)
SELECT
   FromTime = Min([Time]),
   ToTime = CASE WHEN Count(*) = 1 THEN NULL ELSE Max([Time]) END,
   Name = IsNull(Min(CASE WHEN Num = 2 THEN Name ELSE NULL END), Min(Name))
FROM Ranges R
WHERE Result > 0
GROUP BY Result
ORDER BY FromTime;

This updated query has all the same issues as I presented in my explanation, however, they are easier to solve because I am not dealing with the extra unneeded rows. I also see that the Row_Number() / 2 value of 0 I had to exclude, and I am not sure why I didn’t exclude it from the prior query, but in any case this works perfectly and is amazingly fast!

Outer Apply Tidies Things Up

Last, here is a version basically identical to Simon Kingston’s query that I think is an easier to understand syntax.

SELECT
   FromTime = Min(D.Time),
   X.ToTime,
   D.Name
FROM
   #Data D
   OUTER APPLY (
      SELECT TOP 1 ToTime = D2.[Time]
      FROM #Data D2
      WHERE
         D.[Time] < D2.[Time]
         AND D.[Name] <> D2.[Name]
      ORDER BY D2.[Time]
   ) X
GROUP BY
   X.ToTime,
   D.Name
ORDER BY
   FromTime;

Here’s the setup script if you want to do performance comparison on a larger data set:

CREATE TABLE #Data (
    RecordId int,
    [Time]  int,
    Name varchar(10)
);
INSERT #Data VALUES
    (1, 10, 'Running'),
    (2, 18, 'Running'),
    (3, 21, 'Running'),
    (4, 29, 'Walking'),
    (5, 33, 'Walking'),
    (6, 57, 'Running'),
    (7, 66, 'Running'),
    (8, 77, 'Running'),
    (9, 81, 'Walking'),
    (10, 89, 'Running'),
    (11, 93, 'Walking'),
    (12, 99, 'Running'),
    (13, 107, 'Running'),
    (14, 113, 'Walking'),
    (15, 124, 'Walking'),
    (16, 155, 'Walking'),
    (17, 178, 'Running');
GO
insert #data select recordid + (select max(recordid) from #data), time + (select max(time) +25 from #data), name from #data
GO 10

Explanation

Here is the basic idea behind my query.

The times that represent a switch have to appear in two adjacent rows, one to end the prior activity, and one to begin the next activity. The natural solution to this is a join so that an output row can pull from its own row (for the start time) and the next changed row (for the end time).
However, my query accomplishes the need to make end times appear in two different rows by repeating the row twice, with CROSS JOIN (VALUES (1), (2)). We now have all our rows duplicated. The idea is that instead of using a JOIN to do calculation across columns, we’ll use some form of aggregation to collapse each desired pair of rows into one.
The next task is to make each duplicate row split properly so that one instance goes with the prior pair and one with the next pair. This is accomplished with the T column, a ROW_NUMBER() ordered by Time, and then divided by 2 (though I changed it do a DENSE_RANK() for symmetry as in this case it returns the same value as ROW_NUMBER). For efficiency I performed the division in the next step so that the row number could be reused in another calculation (keep reading). Since row number starts at 1, and dividing by 2 implicitly converts to int, this has the effect of producing the sequence 0 1 1 2 2 3 3 4 4 ... which has the desired result: by grouping by this calculated value, since we also ordered by Num in the row number, we’ve now accomplished that all sets after the first one are comprised of a Num = 2 from the “prior” row, and a Num = 1 from the “next” row.
The next difficult task is figuring out a way to eliminate the rows we don’t care about and somehow collapse the start time of a block into the same row as the end time of a block. What we want is a way to get each discrete set of Running or Walking to be given its own number so we can group by it. DENSE_RANK() is a natural solution, but a problem is that it pays attention to each value in the ORDER BY clause–we don’t have syntax to do DENSE_RANK() OVER (PREORDER BY Time ORDER BY Name) so that the Time does not cause the RANK calculation to change except on each change in Name. After some thought I realized I could crib a bit from the logic behind Itzik Ben-Gan’s grouped islands solution, and I figured out that the rank of the rows ordered by Time, subtracted from the rank of the rows partitioned by Name and ordered by Time, would yield a value that was the same for each row in the same group but different from other groups. The generic grouped islands technique is to create two calculated values that both ascend in lockstep with the rows such as 4 5 6 and 1 2 3, that when subtracted will yield the same value (in this example case 3 3 3 as the result of 4 - 1, 5 - 2, and 6 - 3). Note: I initially started with ROW_NUMBER() for my N calculation but it wasn’t working. The correct answer was DENSE_RANK() though I am sorry to say I don’t remember why I concluded this at the time, and I would have to dive in again to figure it out. But anyway, that is what T-N calculates: a number that can be grouped on to isolate each “island” of one status (either Running or Walking).
But this was not the end because there are some wrinkles. First of all, the “next” row in each group contains the incorrect values for Name, N, and T. We get around this by selecting, from each group, the value from the Num = 2 row when it exists (but if it doesn’t, then we use the remaining value). This yields the expressions like CASE WHEN NUM = 2 THEN x END: this will properly weed out the incorrect “next” row values.
After some experimentation, I realized that it was not enough to group by T - N by itself, because both the Walking groups and the Running groups can have the same calculated value (in the case of my sample data provided up to 17, there are two T - N values of 6). But simply grouping by Name as well solves this problem. No group of either “Running” or “Walking” will have the same number of intervening values from the opposite type. That is, since the first group starts with “Running”, and there are two “Walking” rows intervening before the next “Running” group, then the value for N will be 2 less than the value for T in that next “Running” group. I just realized that one way to think about this is that the T - N calculation counts the number of rows before the current row that do NOT belong to the same value “Running” or “Walking”. Some thought will show that this is true: if we move on to the third “Running” group, it is only the third group by virtue of having a “Walking” group separating them, so it has a different number of intervening rows coming in before it, and due to it starting at a higher position, it is high enough so that the values cannot be duplicated.
Finally, since our final group consists of only one row (there is no end time and we need to display a NULL instead) I had to throw in a calculation that could be used to determine whether we had an end time or not. This is accomplished with the Min(Num) expression and then finally detecting that when the Min(Num) was 2 (meaning we did not have a “next” row) then display a NULL instead of the Max(ToTime) value.

I hope this explanation is of some use to people. I don’t know if my “row-multiplying” technique will be generally useful and applicable to most SQL query writers in production environments because of the difficulty understanding it and and the difficulty of maintenance it will most certainly present to the next person visiting the code (the reaction is probably “What on earth is it doing!?!” followed by a quick “Time to rewrite!”).

If you have made it this far then I thank you for your time and for indulging me in my little excursion into incredibly-fun-sql-puzzle-land.

See it For Yourself

A.k.a. simulating a “PREORDER BY”:

One last note. To see how T - N does the job–and noting that using this part of my method may not be generally applicable to the SQL community–run the following query against the first 17 rows of the sample data:

WITH Ranks AS (
   SELECT
      T = Dense_Rank() OVER (ORDER BY Time),
      N = Dense_Rank() OVER (PARTITION BY Name ORDER BY Time),
      *
   FROM
      #Data D
)
SELECT
   *,
   T - N
FROM Ranks
ORDER BY
   [Time];

This yields:

RecordId    Time Name       T    N    T - N
----------- ---- ---------- ---- ---- -----
1           10   Running    1    1    0
2           18   Running    2    2    0
3           21   Running    3    3    0
4           29   Walking    4    1    3
5           33   Walking    5    2    3
6           57   Running    6    4    2
7           66   Running    7    5    2
8           77   Running    8    6    2
9           81   Walking    9    3    6
10          89   Running    10   7    3
11          93   Walking    11   4    7
12          99   Running    12   8    4
13          107  Running    13   9    4
14          113  Walking    14   5    9
15          124  Walking    15   6    9
16          155  Walking    16   7    9
17          178  Running    17   10   7

The important part being that each group of “Walking” or “Running” has the same value for T - N that is distinct from any other group with the same name.

Performance

I don’t want to belabor the point about my query being faster than other people’s. However, given how striking the difference is (when there are no indexes) I wanted to show the numbers in a table format. This is a good technique when high performance of this kind of row-to-row correlation is needed.

Before each query ran, I used DBCC FREEPROCCACHE; DBCC DROPCLEANBUFFERS;. I set MAXDOP to 1 for each query to remove the time-collapsing effects of parallelism. I selected each result set into variables instead of returning them to the client so as to measure only performance and not client data transmission. All queries were given the same ORDER BY clauses. All tests used 17,408 input rows yielding 8,193 result rows.

No results are displayed for the following people/reasons:

RichardTheKiwi *Could not test--query needs updating*
ypercube       *No SQL 2012 environment yet :)*
Tim S          *Did not complete tests within 5 minutes*

With no index:

               CPU         Duration    Reads       Writes
               ----------- ----------- ----------- -----------
ErikE          344         344         99          0
Simon Kingston 68672       69582       549203      49

With index CREATE UNIQUE CLUSTERED INDEX CI_#Data ON #Data (Time);:

               CPU         Duration    Reads       Writes
               ----------- ----------- ----------- -----------
ErikE          328         336         99          0
Simon Kingston 70391       71291       549203      49          * basically not worse

With index CREATE UNIQUE CLUSTERED INDEX CI_#Data ON #Data (Time, Name);:

               CPU         Duration    Reads       Writes
               ----------- ----------- ----------- -----------
ErikE          375         414         359         0           * IO WINNER
Simon Kingston 172         189         38273       0           * CPU WINNER

So the moral of the story is:

Appropriate Indexes Are More Important Than Query Wizardry

With the appropriate index, Simon Kingston’s version wins overall, especially when including query complexity/maintainability.

Heed this lesson well! 38k reads is not really that many, and Simon Kingston’s version ran in half the time as mine. The speed increase of my query was entirely due to there being no index on the table, and the concomitant catastrophic cost this gave to any query needing a join (which mine didn’t): a full table scan Hash Match killing its performance. With an index, his query was able to do a Nested Loop with a clustered index seek (a.k.a. a bookmark lookup) which made things really fast.

It is interesting that a clustered index on Time alone was not enough. Even though Times were unique, meaning only one Name occurred per time, it still needed Name to be part of the index in order to utilize it properly.

Adding the clustered index to the table when full of data took under 1 second! Don’t neglect your indexes.

More Related Contents:

Leave a Comment Cancel reply