Grouping by Arbitrary Time Ranges (Graphing What You Can See)

First, the back story. One of the MEM developers asked me today about an interesting problem:

We have graphs in MEM that can be generated against an arbitrary time range in the UI – even for a year or more if the data is not purged. Currently MEM does not do any kind of rolling up of the data (in an RRD style), and pulls graph data from each agent/instance on a 1 minute interval. So if you wanted to pull, for instance, the last 3 months worth of data in to a graph – the server back end basically goes back to the database and requests all of the rows – all ~43,829 of them, oh, and that’s for each series – and then calculate deltas on the server side if need be (we store raw values), and stuffs the data in to a graphing library to draw the graph.

Further, graphs are only of a limited (but adjustable) size on the screen – I run MEM with a 900 pixel wide graph personally – so there’s only ~900 points (maybe a few less, with the axis etc. being shown) that can be filled with data.

Trying to fill 900 points with ~43,000 points is an exercise in futility. Damn those pixels. Why doesn’t everybody have a cinema in their ops centers? Filling up your memory with all that data on the app server side is also another issue.

So we’ve been looking at various ways of a) rolling this data in storage, and b) only selecting what can be shown on the screen if possible. The MEM dev hit me up about the latter today, and basically asked – “Given a time range, and a number of pixels, how we can we group a set of rows together to get an aggregated result that only returns the same number of rows as we have pixels?”

Math to the rescue!

Luckily we store the timestamps for the data we collect in a BIGINT column – we store a unix style timestamp with millisecond resolution times. If you’re not doing this, then you should convert your DATETIME / TIMESTAMP etc. to an integer (with UNIX_TIMESTAMP) first.

So I started off with the three known constants:

SET @start_time = 1254325343000; // About a month ago
SET @end_time = 1256908959000; // About now
SET @pixels = 900;

Now, to get the interval that you need to group, you have to first get the full integer range you are dealing with, and then divide it by the number of pixels:

SET @intervals = ((@end_time - @start_time) / @pixels);

Now that we know the interval size that we are going to GROUP together, how do we group it? Math to the rescue again – to generate the time slices you round the whole timestamp number to the nearest interval, and then convert it back again, basically:

FLOOR( (timestamp_col) /  @intervals) *  @intervals

To generate the range, I used a little CONCAT magic (and note that you add one before converting back for the end of the range), and then GROUP BY the output, here’s an example based on the MEM data set – which stores it’s timestamp in “end_time”:

SELECT instance_attribute_id, 
       SUM(value) sum_value,
       MIN(value) min_value,
       MAX(value) max_value,
       AVG(value) avg_value,
       COUNT(value) agg_points,
       CONCAT( ROUND( FLOOR( (end_time) /  @intervals) *  @intervals), ' - ', 
               ROUND( (FLOOR( (end_time) /  @intervals ) + 1) *  @intervals)) AS time_range
  FROM dc_ng_long_now 
 WHERE instance_attribute_id IN (73, 76, 77, 79)  
   AND end_time BETWEEN @start_time AND @end_time
 GROUP BY instance_attribute_id, time_range

The above is what could be used to generate the CPU Usage graph (the instance ID’s are each of the series) in the MEM UI, for a month of data. Some sample output:

+-----------------------+----------------+--------------+--------------+-------------------+------------+-------------------------------+
| instance_attribute_id | sum_value      | min_value    | max_value    | avg_value         | agg_points | time_range                    |
+-----------------------+----------------+--------------+--------------+-------------------+------------+-------------------------------+
|                    73 |   884176993340 | 442088260860 | 442088732480 | 442088496670.0000 |          2 | 1254322602524 - 1254325473209 | 
|                    73 | 21220811440010 | 442089201580 | 442111284920 | 442100238333.5417 |         48 | 1254325473209 - 1254328343893 | 
|                    73 | 21221895300670 | 442111747690 | 442133893930 | 442122818763.9583 |         48 | 1254328343893 - 1254331214578 | 
|                    73 | 21222981255310 | 442134365100 | 442156523250 | 442145442818.9583 |         48 | 1254331214578 - 1254334085262 | 
|                    73 | 20781887896940 | 442156993950 | 442178660650 | 442167827594.4681 |         47 | 1254334085262 - 1254336955947 | 
|                    73 | 21225129132600 | 442179133270 | 442201248030 | 442190190262.5000 |         48 | 1254336955947 - 1254339826631 | 
|                    73 | 21226213026150 | 442201720960 | 442223827720 | 442212771378.1250 |         48 | 1254339826631 - 1254342697316 | 
|                    73 | 21227293251850 | 442224296570 | 442246191720 | 442235276080.2083 |         48 | 1254342697316 - 1254345568000 | 
|                    73 | 21228371431870 | 442246663300 | 442268791290 | 442257738163.9583 |         48 | 1254345568000 - 1254348438684 | 
|                    73 | 21229455531730 | 442269262370 | 442291404770 | 442280323577.7083 |         48 | 1254348438684 - 1254351309369 | 
|                    73 | 20788226878750 | 442291875550 | 442313515300 | 442302699547.8723 |         47 | 1254351309369 - 1254354180053 | 
|                    73 | 21231576621020 | 442313990190 | 442333971060 | 442324512937.9167 |         48 | 1254354180053 - 1254357050738 | 
|                    73 | 21232474653460 | 442334346060 | 442352121880 | 442343221947.0833 |         48 | 1254357050738 - 1254359921422 | 
|                    73 | 21233360433480 | 442352500510 | 442371671120 | 442361675697.5000 |         48 | 1254359921422 - 1254362792107 | 
|                    73 | 21234392588940 | 442372144460 | 442394227610 | 442383178936.2500 |         48 | 1254362792107 - 1254365662791 | 
|                    73 | 21235472883050 | 442394698990 | 442416721710 | 442405685063.5417 |         48 | 1254365662791 - 1254368533476 | 
|                    73 | 21236557207050 | 442417196420 | 442439347780 | 442428275146.8750 |         48 | 1254368533476 - 1254371404160 | 
|                    73 | 20795177199970 | 442439809050 | 442461261420 | 442450578722.7660 |         47 | 1254371404160 - 1254374274844 | 
|                    73 | 21238693926960 | 442461733210 | 442483851610 | 442472790145.0000 |         48 | 1254374274844 - 1254377145529 | 
|                    73 | 21239778454900 | 442484323030 | 442506442450 | 442495384477.0833 |         48 | 1254377145529 - 1254380016213 | 

We can see how many rows have actually been aggregated in to the interval too (scroll to the right) – so that we can decide if there were enough intervals during the first row to be a good enough average (sometimes the first and last intervals may not be, and should perhaps be discarded), and the the interval that each row is computed for, i.e “1254325473209 – 1254328343893”

And a little verification:


mysql> select count(*) 
    ->   from dc_ng_long_now 
    ->  where instance_attribute_id in (73, 76, 77, 79) 
    ->    and end_time between @start_time and @end_time;
+----------+
| count(*) |
+----------+
|   172240 | 
+----------+
1 row in set (0.12 sec)

mysql> select count(*) from(
    -> select instance_attribute_id, 
    ->        MIN(value),
    ->        MAX(value),
    ->        AVG(value),
    ->        CONCAT( ROUND(FLOOR( end_time /  @intervals) *  @intervals), ' - ', 
    ->                ROUND(( FLOOR( end_time /  @intervals ) + 1 ) *  @intervals)) AS time_range
    ->   from dc_ng_long_now 
    ->  where instance_attribute_id in (73, 76, 77, 79) 
    ->    and end_time between @start_time and @end_time
    ->  group by instance_attribute_id, time_range
    -> ) s1;
+----------+
| count(*) |
+----------+
|     3604 | 
+----------+
1 row in set (1.01 sec)

Note, I’m getting 4 sets of 901 rows, in the above case I’m getting roughly a 48 to 1 compression ratio for a month of data – going from 172,240 rows to 3604!.

For MEM, there’s still work to do, but we’re on a roll now! Hopefully graph performance will be greatly improved in coming versions! 🙂

2 thoughts on “Grouping by Arbitrary Time Ranges (Graphing What You Can See)”

  1. First it’s clear that pulling data out of the database that you don’t need is not helpful, so I think the idea to filter the results into a smaller result set is good.

    You say: “Note, Iâ??m getting 4 sets of 901 rows, in the above case Iâ??m getting roughly a 48 to 1 compression ratio for a month of data – going from 172,240 rows to 3604!.”. So if looking at 1 year of data (a “nice” period to look at) you are looking at getting the results from 2,000,000 rows of data.

    Also 1 week’s worth of data currently needs year’s worth of data IIRC over 500 MB of data in the merlin db for a single agent’s data. Extend those numbers to 1 year and a larger number of agents and it becomes clear that merlin currently requires a huge amount of data for anything but a small setup. This really does need to be addressed, not only because of the data storage issues but also because doing what you suggest to collect a year’s worth of data for several graphs is going to pull a lot of data out of the database.

    If it’s in cache you are fine, if not you are pulling data off disk and performance will degrade significantly.

    I have seen this with our current merlin instance but admittedly that is not using the “aggregation” mechanism you are talking about.

    One other thing which needs care in these graphs you produce is when producing data of “server group” information. It’s a little too easy to select “All Servers” and you end up almost doing a table scan which is not a good thing to do.

    So I would suggest that you consider the situation from the user’s point of view:
    (1) do not take too long to fill the page with the various graphs
    (2) do not look at ALL data. Really, trust me, there’s little point in scanning all the data, just stake a “sample” when the stored resolution is higher. That should help performance quite a lot. This DOES mean that some data is lost, but that is fine, we accept that. Also remember the database may actually be quite busy. If sharing resources with the tomcat front-end then you really don’t have that much extra space for caching as might be desirable. Not everyone will be willing to split the front-end and back-ends out to improve performance.
    (3) Look at the rollup. We DO want to look at 1 year’s worth of data for all our monitored instances and requiring 1 TB or more to store that is not a realistic proposition for any merlin user. It can be done, but will be hard to justify to management. Look at this part, I asked for it in merlin 1.2 and have been patiently waiting. Hopefully it will arrive before merlin version 12. Until this is resolved we HAVE to rely on other products (cacti) to tell us what our databases are doing. That’s rather unfortunate and you should be aiming to avoid that.

  2. Thank you for sharing this magnificent tip! I had been messing around with my queries for the last couple of days, trying to group rows by a certain time range.
    All I can say is it was starting to look very awful 🙂

    Thanks, this post made my day 😉

Got something to say?