Rrdtool strange rounding produces seesaw graphs

 

I have need of monitoring some processes — for example, count of requests per minute to external services.

First choice to have monitoring and graphs tool was rrdtool. It pretty simple choice, since there’s not much alternatives around.

So, background: Software X makes requests to somewhere, and increments some local counter. Each minute it tries to dump value to rrdtool.

«Tries» is the keyword, since dump happens not by cron, but in non-priority thread instead.

Rrd db file created with such settings:

rrdtool create -s 60 testing.rrd DS:cnt:GAUGE:60:U:U RRA:AVERAGE:0.5:1:1440 RRA:AVERAGE:0.5:60:720 RRA:AVERAGE:0.5:1440:365

This all came from internet/mans/tutorials. Basically, I’m interested in each minute’s value of requests (so, I chosed GAUGE as advices rrdtool homepage).

Second part — data dump, it happens not quite each minute exactly, but instead around each 60000 msecs, maybe plus some time to call «rrdupdate». So, probably this happens each 60500-61000 millisecs.

Seems reasonable, isn’t it? For hour I’ll get maybe 59 data points instead 60.

But instead I got some strange rounded values in rrd database…

To show problem, I created empty file with specified above command, and run for a while such commands in cycle:

rrdtool update testing.rrd N:5

sleep 61

This is quite similar behavior that I have in my software. In each minute I have cnt value = 5. Due to update going not each 60 seconds exactly, I can accept that in base I’ll have some value which close to 5.0 but a bit less than it.

What I got instead you can see in this graph:

rrd_seesawThis is very bad graph for me. I can deal with 59 records with 5.0 and one missing; Or I can live with 60 values with 4.91, But I really need to see trends and dynamics…

Funny thing is, that GAUGE should be used (as recommend site) when you measure some parameter, like temperature, but it failing to deal with unregular updates.

I could use ABSOLUTE, with hacks: I need to multiply value from rrd by 60 when I’m plotting graph. Also, If I’ll change period of updates from 60 secs to some other value, I’ll get garbage set of data, which not represents anything useful.

The question is, are there any way to measure data, with getting close to actual data in rrd, using «N:» update, or this problem should be solved with hacks-bikes in every usage?

Or am I missing something? I’ll appreciate any tips, comments.

 

P.S. I recreated data file with bigger heartbeat = 1800, here’s graph:

testing.rrd-avg-hb1800

P.P.S. Culprit happens to be old forgotten script, which ticked every minute and just placed «0» to all data files in given directory!

Bind99 DNS amplification attack workaround

If you happen to got DNS spoof attack, there’s easy way to prevent it.

Maybe it will not work in your case, but it is working  just fine in my case.

1. So, what is exactly «DNS amplification attack» — you can read in google, but in simple words — your DNS server is used to flood some IP. Since you know which IP targeted (it’s specified as spoofed request source), you can simply block it. Yes, including legitimate traffic, but this is not bad as it seems.

2. How can you prevent it — simple answer almost no, you can’t. Yes, you can limit rate of requests, you can slip answers, but it’s half measures.

3. Okay, prerequisites then (named, ipfw, cron):

  • Make sure that you have some rule in ipfw lower than 10000 which will allow all your legitimate traffic!
  • enable logging in named.conf, at least for queries level (If you change settings which influenced on count and ordering of columns, make sure you’ll tweak parser itself, e.g. change $5 — to something which will work for you):

    logging {
    category «queries» { «debug»; };
    channel «debug» {
    file «/var/log/named.log» versions 3 size 10m;
    print-time yes;
    print-category yes;
    };
    };

    You don’t need big log file for this purpose, 10m seems reasonable.

  • create parser script which will gather any new IP from named.log (I think all queries mentioned «IN ANY» should be blocked, maybe draconian, but working.)

    #!/bin/sh

    pre=»/root/»
    /usr/bin/grep «IN ANY» /var/named/var/log/named.log* | /usr/bin/grep -v «127.0.0.1» > ${pre}fullLog
    /usr/bin/awk ‘{print $5}’ ${pre}fullLog | /usr/bin/awk -F»#» ‘ {print $1} ‘ | /usr/bin/sort | /usr/bin/uniq > ${pre}newIps
    /bin/cat ${pre}allIps >> ${pre}newIps
    /bin/cat ${pre}newIps | /usr/bin/sort | /usr/bin/uniq > ${pre}allIps
    /sbin/ipfw delete 10000
    for i in `/bin/cat ${pre}allIps` ; do
    /sbin/ipfw add 10000 deny all from $i to me
    done

  • put in cron your script, every few minutes will do.
  • eventually fully clear your badlist — add to cron something like «Every hour(day,month) clear /root/allIps»

 

That’s it, you kinda made this attack useless.