In January, in the wake of all the hubbub over earmarks, the US House of Representatives and Senate both enacted rules requiring that members publish earmark requests on their websites. The House deadline was over the weekend, but as reported today in Realtime Investigations, about a third of representatives blew off the rule. And, as predicted on a Sunlight blog post when the rules were enacted, the lack of clear rules about how the earmarks should be published makes it pretty tough to do any bulk analysis on the data.
That said, I got kind of infected with the web-scraping thing at the PyCon OpenGov Hackathon (that’s me in the far corner, hunched over my laptop) and decided to take a crack at breaking out the data myself. With a few hours of work, I managed to produce consistently formatted data files for three districts. That’s only 1% of all the posted requests (according to Realtime), but it is something that is doable. It might even be doable without knowing any programming, with some patient copy/paste/search/replace (a.k.a “gymnastics with words“).
Here’s what I have so far:
- IL-9 (Schakowsky) – My district
- FL-10 (Young) – A random district chosen to experiment with extracting the text from PDF
- HI-1 (Abercrombie) – the first one on the list, just because I thought I’d do one more.
- AL-4 (Aderholt) – a bonus one added after this original post. This is the first which includes rows which have sources but no data, because some of his files had problems. (see below)
If you decide to take a shot at this yourself, here are the columns I used. Staying consistent would make it easier to knit stuff together in the end:
- State (standard 2-char postal code)
- District (a number)
- Source (the URL from which the data came, in case it needs to be reviewed)
- Project name
- Requestor (the name of the organization requesting the earmark)
- Requestor Address
- Amount (leave out dollar signs and commas please)
I don’t know if the original rules (“post whatever you want, whereever you want”) were intended to discourage anything useful from being done with the data, or if they were just an oversight… but it would be pretty great to do something useful with the data despite them.
Update: Since a few folks tweeted about this post, I felt like I needed more to show, so I tackled the next guy on the list, AL-4’s Robert Aderholt, who posted his requests as 265 distinct PDF files! Well, 262, because some weren’t there, and two which were there had just the word “funding” in the location where the amount belonged. Anyway, the
pdftotext tool made quick work of pulling the text out (something which won’t work so well with Hal Rogers’s inscrutable image PDF (for shame!)). From there, the paragraphs were entirely consistent to the casual eye, but actually had about 11 different subtleties as to how the text was formatted. (Maybe more; towards the end I got a little loose with my regular expressions which made it look like I was handling more files, but which left me with cleanup work to do as well…) Anyway, I added it to the list above. And as much as it’s kind of fun for people to notice, I suspect that the payoff in the long run isn’t there for me to slog through all of the rest. Hopefully, though, this will help us draw the line between transparency which follows the “letter of the law” and that which actually honors the spirit.