Thursday, April 03, 2008

PCAP Indexing

There's been some talk inside the Sguil project lately of improving the performance of PCAP data retrieval. That is, when you're trying to fetch the PCAP data for a single session out of a file that contains all the data your sensor saw during that time period, things can get pretty slow.

Our current method involves simply identifying which file probably (note the strategic use of that word) contains the data you're looking for, then using tcpdump with a BPF filter to find it. This usually works well, but it's often very slow, especially if the PCAP file you're searching through is large, say a few GB.

We've discussed a few approaches we could take to improve the performance of these retrievals. One promising way involves creating an index of the sessions inside each PCAP file. It turns out that the tool we're using to collect network session data, SANCP is actually a pretty decent packet logger, even though we're not using it in that manner. The newest 1.6.2 release candidate includes support for PCAP indexing, so I thought I'd take it for a spin and see just what kind of performance improvement we could expect, if any.

My good friend Geek00L recently blogged his experience with SANCP's indexing feature, but he didn't get into the performance, he just wrote about how it worked. You probably should read that, as well as the official docs, if you're interested in this subject.

Another thing I was interested in was figuring out how to use SANCP to index existing PCAP files, as Sguil is already capturing these with Snort. By default, SANCP will only index PCAP files that it creates, usually by sniffing the network interface. I prefer to keep my data collection processes separate if I can, and the existing PCAP collection is working well enough for now. So my first goal was to see if I could convince SANCP to create an index without creating an entirely new PCAP file.

It turns out that this is possible, though kinda kludgey. I was able to use the following ''sancp-pcapindex.conf'' file to create an index:

default index log
default pcap filename /dev/null
format index delimiter=| sancp_id,output_filename,\

This is pretty close to the version listed in SANCP's docs, except that I added the ''default pcap filename /dev/null'' line in there. So SANCP will still create a PCAP file, but it'll be written to /dev/null so I'll never see it.

I also had to use two additional command-line options to turn off the "realtime" and "stats" output SANCP likes to generate by default. so in the final analysis, here's the command line I ended up using:

sancp -r snort.log.12347263 -c sancp-pcapindex.conf -d sancp-output -R -S

Ok, so on to the actual indexing tests! I was curious about several things:

  1. How long does it take to create an index?
  2. How large is the index, compared to the size of the PCAP file itself?
  3. Is there a performance increase by using the index, and if so, how much?
  4. Does extracting PCAP by index return different data than extracting it by tcpdump and a BPF?

I decided that I would choose two PCAP files of different sizes for my tests. One file was 2.9GB, and the other was 9.5GB. For each file, I tested index creation speed and retrieval speed using tcpdump compared to retrieval speed using the index and SANCP's '''' tool. Each of these tests was conducted three times, and the results averaged to form the final result. In addition, I examined the size of the index file once (the last time), under the assumption that a properly-created index would be the same each time it was generated.

PCAP size Index size Index Creation Tcpdump extraction Indexed extraction
2.9GB 446M 7m58s 39s 5s
9.5GB 1300MB 23m20s 2m21s 11s

The larger file is roughly (very roughly) three times the size of the smaller file. As this data suggests, the index size and the index creation time are linear, as the larger of each value is about 3x the size of the smaller. The same is true of the time necessary to extract the data with Tcpdump, though not to quite the same extent (it's just a bit over 3x).

However, the interesting part is that the time required to extract the data using the indices is not linear. It only took a little over 2x as long, though to be honest, it could easily be a matter of the amount of data that was contained in the individual network session or something. Still, using the index was about 87% faster in the small file, and about 92% faster with the larger file.

I think it's pretty clear that these indices speed up PCAP retrieval substantially. I think the drawbacks are that the index files are fairly large, and that they take a long time to generate.

As for the index file size, the indices look like they are about 13% - 15% of the size of the original file. For the drastic performance improvement they provide, this could be worth it. What's one more drive in the RAID array? Also, it's possible that they could be compressed, with maybe only a relatively small impact on retrieval speed. I'll have to try that out.

Index generation time is potentially more serious. Obviously, it'd be nicer to generate the index at the same time the PCAP is originally written, but as I'm unwilling to do that (for the moment, at least), I think the obvious speed-up would be to somehow allow SANCP to generate the indices without trying to write a new PCAP file. Even when I've directed it to /dev/null, there has to be some performance overhead here, and any time spent writing PCAP we're throwing away is just time wasted. This would be my first choice for future work: make a good, quick index for an existing PCAP file.

All in all, I'm impressed with the retrieval speed of SANCP's indexed PCAP. Now, if we can get the index creation issue sorted out, this could be a really great addition to Sguil!

Update 2008-04-03 20:15: I was in such a rush to complete this post, that I accidentally forgot to answer my question #4! It turns out that the data returned by both extraction methods is exactly the same. Even the MD5 checksums of the PCAP files match. Great!

Also, note that I edited the above to add details on the command line I used to generate the indices. Another stupid rush mistake on my part. Sorry!

Wednesday, April 02, 2008

Automating malware analysis with Truman

Let me start by saying, I'm no malware analyst. I've done a little reversing with IDA Pro, but really only in class. However, during an incident investigation, I frequently come across an unknown Windows binary that is likely to be some sort of malware, and would really, really like to know what it does, and stat! I can do a few basic tests on the binary myself, like examining the strings for clues as to it's purpose, or maybe unpacking it if it's using some standard packer. For the most part, though, modern malware is too hard to get a good handle on quickly unless you're aces with a debugger.

A couple of years ago, though, Joe Stewart from LURQH (now SecureWorks) released an analysis framework just for folks like me. The Reusable Unknown Malware Analysis Net (TRUMAN) is a sandnet environment that allows the malware to run on a real Windows system attached to a protected network with a bunch of fake services, then collects data about the program's network traffic, files and registry entries created, and even captures RAM dumps of the infected system. The great thing about TRUMAN is that it not only makes it easy to collect this information, it automates most of the process of creating a secure baseline, analyzing changes against that, and restoring the baseline to the Windows system when it's all over.

The terrible thing about Truman though, is that it is quite a complex system, with a lot of moving parts. Which would be OK, except that it comes with almost no documentation. Seriously. None. There's an INSTALL file that gives a brief overview, but leaves out most of the important steps. Frankly, except for one Shmoocon presentation video, there's nothing on the Internet that really tells you what Truman is, how it works or how to go about installing and using it.

Until now!

I just added a TRUMAN page to the NSMWiki. This contains a lot of information, much more than would comfortably fit into a blog post. Most importantly, it contains a detailed step-by-step process for getting TRUMAN up and running with RHEL5 and Windows XP.

Using Truman, I can collect a substantial amount of information about how a suspicious binary acts when it runs, and do it in a matter of 20 or 30 minutes, rather than hours. Admittedly, it's not foolproof, but it should come in extremely handy next time I run across an unknown Trojan.