Getting data for histogram plot

This is a post about a super quick-and-dirty way to create a histogram
in MySQL for numeric values.

There are multiple other ways to create histograms that are better and
more flexible, using CASE statements and other types of complex logic.
This method wins me over time and time again since it’s just so easy
to modify for each use case, and so short and concise. This is how you
do it:

SELECT ROUND(numeric_value, -2)    AS bucket,
       COUNT(*)                    AS COUNT,
       RPAD('', LN(COUNT(*)), '*') AS bar
FROM   my_table
GROUP  BY bucket;

Just change numeric_value to whatever your column is, change the
rounding increment, and that’s it. I’ve made the bars to be in
logarithmic scale, so that they don’t grow too much when you have
large values.

numeric_value should be offset in the ROUNDing operation, based on the rounding increment, in order to ensure the first bucket contains as many elements as the following buckets.

e.g. with ROUND(numeric_value,-1), numeric_value in range [0,4] (5 elements) will be placed in first bucket, while [5,14] (10 elements) in second, [15,24] in third, unless numeric_value is offset appropriately via ROUND(numeric_value – 5, -1).

This is an example of such query on some random data that looks pretty
sweet. Good enough for a quick evaluation of the data.

+--------+----------+-----------------+
| bucket | count    | bar             |
+--------+----------+-----------------+
|   -500 |        1 |                 |
|   -400 |        2 | *               |
|   -300 |        2 | *               |
|   -200 |        9 | **              |
|   -100 |       52 | ****            |
|      0 |  5310766 | *************** |
|    100 |    20779 | **********      |
|    200 |     1865 | ********        |
|    300 |      527 | ******          |
|    400 |      170 | *****           |
|    500 |       79 | ****            |
|    600 |       63 | ****            |
|    700 |       35 | ****            |
|    800 |       14 | ***             |
|    900 |       15 | ***             |
|   1000 |        6 | **              |
|   1100 |        7 | **              |
|   1200 |        8 | **              |
|   1300 |        5 | **              |
|   1400 |        2 | *               |
|   1500 |        4 | *               |
+--------+----------+-----------------+

Some notes: Ranges that have no match will not appear in the count –
you will not have a zero in the count column. Also, I’m using the
ROUND function here. You can just as easily replace it with TRUNCATE
if you feel it makes more sense to you.

I found it here http://blog.shlomoid.com/2011/08/how-to-quickly-create-histogram-in.html

Leave a Comment