Thursday, April 03, 2008

PCAP Indexing

There's been some talk inside the Sguil project lately of improving the performance of PCAP data retrieval. That is, when you're trying to fetch the PCAP data for a single session out of a file that contains all the data your sensor saw during that time period, things can get pretty slow.

Our current method involves simply identifying which file probably (note the strategic use of that word) contains the data you're looking for, then using tcpdump with a BPF filter to find it. This usually works well, but it's often very slow, especially if the PCAP file you're searching through is large, say a few GB.

We've discussed a few approaches we could take to improve the performance of these retrievals. One promising way involves creating an index of the sessions inside each PCAP file. It turns out that the tool we're using to collect network session data, SANCP is actually a pretty decent packet logger, even though we're not using it in that manner. The newest 1.6.2 release candidate includes support for PCAP indexing, so I thought I'd take it for a spin and see just what kind of performance improvement we could expect, if any.

My good friend Geek00L recently blogged his experience with SANCP's indexing feature, but he didn't get into the performance, he just wrote about how it worked. You probably should read that, as well as the official docs, if you're interested in this subject.

Another thing I was interested in was figuring out how to use SANCP to index existing PCAP files, as Sguil is already capturing these with Snort. By default, SANCP will only index PCAP files that it creates, usually by sniffing the network interface. I prefer to keep my data collection processes separate if I can, and the existing PCAP collection is working well enough for now. So my first goal was to see if I could convince SANCP to create an index without creating an entirely new PCAP file.

It turns out that this is possible, though kinda kludgey. I was able to use the following ''sancp-pcapindex.conf'' file to create an index:

default index log
default pcap filename /dev/null
format index delimiter=| sancp_id,output_filename,\

This is pretty close to the version listed in SANCP's docs, except that I added the ''default pcap filename /dev/null'' line in there. So SANCP will still create a PCAP file, but it'll be written to /dev/null so I'll never see it.

I also had to use two additional command-line options to turn off the "realtime" and "stats" output SANCP likes to generate by default. so in the final analysis, here's the command line I ended up using:

sancp -r snort.log.12347263 -c sancp-pcapindex.conf -d sancp-output -R -S

Ok, so on to the actual indexing tests! I was curious about several things:

  1. How long does it take to create an index?
  2. How large is the index, compared to the size of the PCAP file itself?
  3. Is there a performance increase by using the index, and if so, how much?
  4. Does extracting PCAP by index return different data than extracting it by tcpdump and a BPF?

I decided that I would choose two PCAP files of different sizes for my tests. One file was 2.9GB, and the other was 9.5GB. For each file, I tested index creation speed and retrieval speed using tcpdump compared to retrieval speed using the index and SANCP's '''' tool. Each of these tests was conducted three times, and the results averaged to form the final result. In addition, I examined the size of the index file once (the last time), under the assumption that a properly-created index would be the same each time it was generated.

PCAP size Index size Index Creation Tcpdump extraction Indexed extraction
2.9GB 446M 7m58s 39s 5s
9.5GB 1300MB 23m20s 2m21s 11s

The larger file is roughly (very roughly) three times the size of the smaller file. As this data suggests, the index size and the index creation time are linear, as the larger of each value is about 3x the size of the smaller. The same is true of the time necessary to extract the data with Tcpdump, though not to quite the same extent (it's just a bit over 3x).

However, the interesting part is that the time required to extract the data using the indices is not linear. It only took a little over 2x as long, though to be honest, it could easily be a matter of the amount of data that was contained in the individual network session or something. Still, using the index was about 87% faster in the small file, and about 92% faster with the larger file.

I think it's pretty clear that these indices speed up PCAP retrieval substantially. I think the drawbacks are that the index files are fairly large, and that they take a long time to generate.

As for the index file size, the indices look like they are about 13% - 15% of the size of the original file. For the drastic performance improvement they provide, this could be worth it. What's one more drive in the RAID array? Also, it's possible that they could be compressed, with maybe only a relatively small impact on retrieval speed. I'll have to try that out.

Index generation time is potentially more serious. Obviously, it'd be nicer to generate the index at the same time the PCAP is originally written, but as I'm unwilling to do that (for the moment, at least), I think the obvious speed-up would be to somehow allow SANCP to generate the indices without trying to write a new PCAP file. Even when I've directed it to /dev/null, there has to be some performance overhead here, and any time spent writing PCAP we're throwing away is just time wasted. This would be my first choice for future work: make a good, quick index for an existing PCAP file.

All in all, I'm impressed with the retrieval speed of SANCP's indexed PCAP. Now, if we can get the index creation issue sorted out, this could be a really great addition to Sguil!

Update 2008-04-03 20:15: I was in such a rush to complete this post, that I accidentally forgot to answer my question #4! It turns out that the data returned by both extraction methods is exactly the same. Even the MD5 checksums of the PCAP files match. Great!

Also, note that I edited the above to add details on the command line I used to generate the indices. Another stupid rush mistake on my part. Sorry!


Brazilian said...

Hi, did you have any progress on this issue ? I am also looking for a quick way to index and then extract TCP flows with its payload after a keywork-based search.
The size of each pcap would be around 2-3GB each... best if I could search into a folder with many pcap files... Iknow it will be slow, just need to be sure it there's no other way.. :-)

David Bianco said...

Brazilian, I committed code to Sguil shortly after I wrote this post to capture PCAP and do indexing at the same time. It pretty much works as you describe, and is very fast. You pay a disk penalty for the index, but it's quite fast even for large datasets. Also, it gets around one of Sguil's traditional problems retrieving PCAP for sessions that span more than one file.

Ashish said...

For Advanced PCAP indexing you can use PCAP2XML tool for converting your PCAP into XML or SQlite and according to your need you can make changes into it. Have a look if you are interested:

Tool: -
Tool Blog: -