Tuesday, March 25, 2008

Temporarily speed up SANCP insertions in Sguil

It's Monday morning, you're half asleep, you haven't finished your first diet soda yet, and -- oh no! Sguild has been down all weekend! Worse yet, SANCP inserts are backed up, to the tune of 17,000+ files in the queue!

As you know, the Sguil sensors are pretty much independent of the actual Sguil server. They'll happily continue collecting data, even when sguild has been down for a while. When the server comes back up, the sensors will automagically reconnect and send all the queued data. This is by design, of course. You don't want to lose all that data due to a failure of the central server.

Even after an extended outage, most of the data collected by the Sguil sensors poses no real problem. There are relatively few Snort alerts (maybe a few thousand), and probably even fewer PADS events, and these get added to the database in no time. Network session records collected by SANCP, however, can pose a bigger problem.

If you recall, SANCP works by keeping an in-memory list of active network "sessions" (including psuedo-sessions created from UDP and ICMP traffic). By default, it will dump these to a file every minute or so (or more often, on a busy network). The Sguil sensor contains a SANCP agent process that monitors the filesystem for these files, and sends them to Sguild as they are created, deleting them from the sensor.

Now here's the problem: there are just so many darned network sessions on a busy network that even a short outage can result in a few hundred files waiting to be queued, especially if you have multiple sensors. Longer outages, though, can be disastrous. Let's say that you have six sensors, and your Sguil server has been down for the weekend (48 hours). How many files is that?

60 * 48 * 6 = 17,280

Now, at an average rate of about 5 seconds to insert each file, how many hours would that take to catch up?
17,280 / (60 * 60) = 24

That's right! It'd take a full 24 hours to catch up! In the meantime, you're missing a few days of valuable network data (probably the few days you're most likely to want to query on Monday morning) and your MySQL database is spending all it's time inserting, which means not only that it's slower to respond to your analyst console, but also slower to process incoming events. In fact, it can easily get caught in a sharp downward spiral, where the incoming data gets even further backed up.

So what can you do about this? Actually, it's quite simple. If you find that you're getting behind while processing your backlock of SANCP records, you can dramatically speed things up by temporarily disabling the indices on your SANCP files.

First, figure out which days you have to catch up on. If you know your server crashed on Friday the 8th, and it's now Monday the 11th, you probably want to go through all SANCP tables from Friday - Monday.

Second, determine what the table names will be. Remember that Sguil creates one SANCP table per day, per sensor. These are all merged into a single virtual table, but for indexing purposes, ignore that one and concentrate on the individual tables. They will be named something like:
sancp_$SENSORNAME_$DATE

So for example, if you have two sensors named "external" and "internal", you'd have the following tables:
sancp_external_20080208
sancp_internal_20080208

sancp_external_20080209
sancp_internal_20080209

sancp_external_20080210
sancp_internal_20080210

sancp_external_20080211
sancp_internal_20080211


Next, you simply issue the SQL command to disable indexing for each table:
ALTER TABLE sancp_external_20080208 DISABLE KEYS;

MySQL will perform a quick table check before returning to the prompt. This may take a minute, and I personally find it annoying to wait after each table, so I usually just create a text file with all the commands in it, one per line, and run it batch mode:
mysql -u sguil -p sguildb < DISABLE-KEYS.txt

Based on my experience, I've seen the indexing speed go from 5 seconds per file to about 5 files per second, which is quite significant! At that rate, it would take less than an hour to insert everything!
17,280 / (5 * 60 * 60) = 0.96

Of course, you have to be extra careful to re-enable indices on all those tables. You can run a similar set of SQL commands to turn indices back on for a table:
ALTER TABLE sancp_external_20080208 ENABLE KEYS;

Again, I usually run this as a batch job.

The act of disabling and then later re-enabling indices does take a little while, but usually not more than a few minutes for each. Even given this overhead, it is still significantly faster to process a bunch of SANCP files without indices, then reindex them after you're all caught up.

Sure wish I didn't need to know this... 8-)

Update 2008-03-25 11:27: After you re-enable keys, you may need to also do a quick db check to make everything sane again:
mysqlcheck -o -a -u sguil -p sguildb

This will recheck all your tables and make sure they're still consistent. I've had a few situations where Sguil has been returning error messages like "ERROR 1030 (HY000): Got error 124 from storage engine" until I did this.

No comments: