Blog.HighSpeedWeb.net | Parsing the ridiculously large DNC file

Recently, one of our customers had a unique problem. The national Do Not Call (DNC) list they pay $15,000 (yes, that’s a comma and yes those are zeros after it) per year for access to exceeded the 2Gb limit imposed by older systems they were running. Specifically, fox pro couldn’t handle 2Gb+ files. So they asked if we could make something for them that would separate the file into smaller files that could be consumed by the out dated system.

So we did…

The DNC file is a list of phone numbers that telemarketers are required to not call. It is a text file with one number per line in the form of 123,4567890 where the area code is separated by a comma and the rest of the number follows. We decided to break this up by the first number of the area code, though it can be easily modified to have the first two or three digits. It will place these in files named X.txt where X is the number we are separating it by. So 1.txt contains all area codes that start with 1, and so on.

I did this in perl for several reasons:

Few languages handle text as well as perl
Perl was readily available and easy to whip this out in
The irony of having linux/perl do what their windows/foxpro could not made me chuckle.

The file was an executable perl script that accepted the name of the DNC file as it’s only argument. It then opens that file for reading, partially cause we need to know it actually exists and partially because we need to read from it. Then we open another file for the first X.txt file we are going to spit numbers into.

As we loop through the lines of the DNC file, we check to make sure it starts with the same number as the file we have open for writing. If it does, we put it inside. If it does not, we close the previous file and open a new one with the new number. This works since the DNC file is in numerical order. It tells us when it has hit a new file so we can kinda keep tabs on progress. It also let’s us know if there were any unrecognized lines encountered.

So here it is:


#! /usr/bin/perl -w

use strict;

if(!@ARGV) { die "I need a file to parse!"; }

my $file = $ARGV[0];

open(my $fh, "<", $file) or die $!;

my $start = 1;
open(my $cfh, ">", $start.".txt");
while(<$fh>) {
	if(/(d)d{2},d{7}/) {
		# print "Found ".$1."n";
		if ($1 == $start) {
			print $cfh $_, 
		} else {
			$start = $1;
			print "Creating new number file ".$1."n";
			close $cfh;
			open($cfh, ">", $1.".txt") or die $!;
			print $cfh $_;
		}
	} else {
		print "Unrecognized line: ".$_."n";
	}
}

And that’s it! copy that into a file, make it executable and have at it. It tore through the DNC in about 5 minutes. And for 200 million rows, that’s not bad.

PS: Our customer liked this, and had us run it once. When they came back the next month and asked us to run it again and we saw it was going to be a monthly event, I sighed, and rewrote it in .NET so they could do it themselves. I will post that next.

2 thoughts on “Parsing the ridiculously large DNC file – Perl Edition”

Mohit Vohra says:

October 27, 2010 at 12:40 am

Hi Jay,

I was going through your script and liked it a lot. However, since I’m new to Perl so I didn’t understand one line in your script:

if (/(d)d{2},d{7}/)

What are we trying to accomplish with this condition? Sorry if my question sounds basic, am still learning the ropes so hope you dont mind answering this.

Secondly, if they wanted a monthly job, couldnt we set Perl script to run as windows service? I have seen this in some forums, but not sure so maybe you could have just set this script to run as a windows service and they would not need to worry about it.

Log in to Reply
- Jay Ward says:
  
  October 27, 2010 at 10:12 am
  
  Mohit,
  
  It’s a regular expression that says “find a digit followed by 2 digits followed by a comma followed by 7 digits”. d matches any digit 0-9 and we store the first one with the parenthesis so we can separate by that first digit.
  
  We could have definitely done some sort of service. On Windows, I would rather go with .NET since Windows likes it better and we can do more things quickly and easily in Windows with it. Which is why I did the Parsing the Ridiculously Large DNC File – .NET Edition.
  
  As far as it being a service though, that presents several other problems that I didn’t want to have to work out with them. Namely, the location and name of the file would need to be semi-consistent and would need to be in said place at a fairly specific time, neither of which are very consistent in their industry. So it was easier just to hand them something they could control rather than trying to force it with a service. They liked it better that way too, so it worked out.
  
  Log in to Reply

Leave a Reply Cancel reply