I have need of monitoring some processes — for example, count of requests per minute to external services.
First choice to have monitoring and graphs tool was rrdtool. It pretty simple choice, since there’s not much alternatives around.
So, background: Software X makes requests to somewhere, and increments some local counter. Each minute it tries to dump value to rrdtool.
«Tries» is the keyword, since dump happens not by cron, but in non-priority thread instead.
Rrd db file created with such settings:
rrdtool create -s 60 testing.rrd DS:cnt:GAUGE:60:U:U RRA:AVERAGE:0.5:1:1440 RRA:AVERAGE:0.5:60:720 RRA:AVERAGE:0.5:1440:365
This all came from internet/mans/tutorials. Basically, I’m interested in each minute’s value of requests (so, I chosed GAUGE as advices rrdtool homepage).
Second part — data dump, it happens not quite each minute exactly, but instead around each 60000 msecs, maybe plus some time to call «rrdupdate». So, probably this happens each 60500-61000 millisecs.
Seems reasonable, isn’t it? For hour I’ll get maybe 59 data points instead 60.
But instead I got some strange rounded values in rrd database…
To show problem, I created empty file with specified above command, and run for a while such commands in cycle:
rrdtool update testing.rrd N:5
This is quite similar behavior that I have in my software. In each minute I have cnt value = 5. Due to update going not each 60 seconds exactly, I can accept that in base I’ll have some value which close to 5.0 but a bit less than it.
What I got instead you can see in this graph:
Funny thing is, that GAUGE should be used (as recommend site) when you measure some parameter, like temperature, but it failing to deal with unregular updates.
I could use ABSOLUTE, with hacks: I need to multiply value from rrd by 60 when I’m plotting graph. Also, If I’ll change period of updates from 60 secs to some other value, I’ll get garbage set of data, which not represents anything useful.
The question is, are there any way to measure data, with getting close to actual data in rrd, using «N:» update, or this problem should be solved with hacks-bikes in every usage?
Or am I missing something? I’ll appreciate any tips, comments.
P.S. I recreated data file with bigger heartbeat = 1800, here’s graph:
P.P.S. Culprit happens to be old forgotten script, which ticked every minute and just placed «0» to all data files in given directory!