Thursday, July 13, 2006

Canonicalizing funky IP addresses

I've been playing around with some phishing data recently, to see if I can correlate known phishing attempts with my outgoing network sessions in order to alert me when my users fall victim. I don't have any results to report for this yet, but I did have to do some research into canonicalizing IP addresses.

Most people don't realize this, but there are a few different IP address formats. We're using to seeing the dotted decimal quad form (e.g. "") but there are others, and the phishers are using them. Thus, it behooves us to know how to work with them.

I was unable to find a comprehensive list of available formats online for some reason (if you know of one, please leave a comment!) so I examined my corpus of phishing URLs and found the following examples:

  2. 1127353292
  3. 0x43320bcc
  4. 192.0x43.9.3 (i.e., mixed decimal and hex quads)

Of course, phishers are also using normal hostnames as well, giving us at least 5 different types of possible inputs to deal with. Most of our tools use the decimal dotted quad format, so we need to normalize these formats into something useful.

Here's a Perl function that should do this for you. If you pass it any of the above formats (including the hostname), it will normalize the input and return a decimal dotted quad in string form, suitable for your favorite security tool. If it can't be converted (maybe it's a hostname that no longer resolves) the function returns undef instead.

I've wrapped this in a simple command-line tool and also incorporated it into other Perl scripts. I'm sure you'll find other uses for it as well. And hey, if you come across any other legal address formats, please let me know.

use Socket;

# Take any valid IP address or hostname and return the normalized IP address.
# Recognizes the following formats:
# 0x43320bcc
# 1127353292
# 192.0x1c.0x10.9 (ie, mixed decimal and hex quads)
# The function returns the string value of the IP address if successful, or
# undef if not.
# Warnings:
# 1) This will return only a single address, so if the hostname resolves
# to more than one address, you won't get them all.
# 2) It will generate a DNS query when looking up hostnames
sub normalize_funky_addrs {
my($host) = @_;

# If this is a hex address (e.g., "0x01234FAc") then first convert it to
# decimal form
if($host =~ m/^0x[a-fA-F0-9]+$/i) {
$host = hex($host);

# If the entire address wasn't hex, individual octets might be. If so,
# find them and convert them. Split the address into octets, check and
# covert hex in each octect as necessary, then paste them all back together
# into one string and continue processing. Yes, I could just return here,
# but I'd rather have only one function exit point.
if($host =~ m/^([^.]+)\.([^.]+)\.([^.]+)\.([^.]+)$/) {
($octet1, $octet2, $octet3, $octet4) = ($1, $2, $3, $4);
$octet1 = hex($octet1) if ($octet1 =~ m/0x/i);
$octet2 = hex($octet2) if ($octet2 =~ m/0x/i);
$octet3 = hex($octet3) if ($octet3 =~ m/0x/i);
$octet4 = hex($octet4) if ($octet4 =~ m/0x/i);
$host = "$octet1.$octet2.$octet3.$octet4";

# Now we've either got a hostname, a normal IP address or an integer-form
# IP address. We can work with any of those.
$addr = gethostbyname($host);
if($addr) {
return inet_ntoa($addr);
} else {
return undef;

No comments: