Records request stalling? Scrape their site!

Steve Garrison knew he was being stonewalled.

Since June, Garrison had been waiting for an Indiana state agency to respond to his open records request. He knew the data existed and, if he could just get a full copy of it, he could finally answer a question that had bugged him for over a year.

As a court reporter for the The Times of Northwest Indiana, Garrison encountered many cases involving domestic violence and other forms of threats and abuse. As a precautionary measure in such a case, a court may issue a protection order requiring the defendant to avoid contact or communication with the petitioner. That court order is then delivered to the defendant by a local law enforcement officer.

Except the last step didn't always happen, Garrison noticed. This shortcoming had real public safety consequence, a protection order must be properly served. Otherwise, it isn't legally enforceable, leaving potential victims vulnerable to further abuse.

So how often are protection orders properly served in Indiana? This was the question still bugging Garrison, who is now a masters student in the Missouri School of Journalism.

In order to answer it, Garrison requested a copy of the statewide Protection Order Registry. This database maintained by state Office of Judicial Administration records every protection order issued by an Indiana court, including when (or if) it was served.

Portions of the registry are publicly accessible via a website. So after five months of waiting with no other options in sight, Garrison began systematically searching through this website, navigating to the details for each protection order, then copying and pasting key text from each web page into a spreadsheet.

In order to collect the data he needed, Garrison would have to repeat this process tens of thousands of times.

Teaming up to find a solution 

Kelly Kenoyer took note of Garrison's enormous endeavor and offered to help her fellow graduate student. She recently learned Python and web scraping techniques while enrolled in the advanced data journalism class at the Missouri School of Journalism.

Web scraping can be a huge time saver for journalists who have little time to spare. Instead of pointing, clicking, copying and pasting the data you want into a spreadsheet, you can script out these instructions and let the computer do all the work.

With my guidance, Kenoyer wrote a Python program that:

  • cached search results from the protection order registry website;
  • parsed text out of the html;
  • validated the parsed text against our expected data structure; and finally,
  • wrote the results to a SQLite database, a format that Garrison knew how to query.

We successfully automated all of Garrison's tedious work. The next step was to scale it up.

Given the terribly slow response time of the state agency's web server (30 seconds or longer for a single GET request), our scraper would need to run continuously for almost three weeks to collect the 60,000 highest priority records.

This estimate assumes we would execute the scraping job the same way most people learn to write and run Python programs: As a single, synchronous process wherein the code is executed one call at a time, in the order it appears in the main script.

If, however, we could run the same code in 10 parallel processes, we could cut down our time to about 48 hours.

That's exactly what we did, and—after learning more about multiprocessing in Python—it was easy.

The “hack” way

To run a web scraping program written in Python, typically you open your terminal emulator and invoke the Python interpreter along with the name of your main script file.

The command looks something like this:

python scrape.py

Say you wanted to cut your scraping time in half. You might get clever and open a second terminal window and invoke the same command. Now you've doubled the number of processes you have running, but your scraper won't finish any sooner. You still need to distribute the work of the scraper across these parallel processes.

When scraping a website, most of your program's running time is spent waiting for responses and downloading content from URLs that follow a particular pattern defined by the site's web developers. In the case of Indiana's protection order registry, this pattern is https://mycourts.in.gov/PORP/Search/Detail?ID={i}. The {i} is a placeholder for the unique identifier of each protection order.

So why not divide this exact task between the two parallel processes? We can do so by modifying scrape.py to get all the protection orders between two identifiers, specified via command line arguments:

python scrape.py --start=1 --end=30000

Now we can invoke the same command in separate terminal windows without duplicating our efforts.

The “Pythonic” way

The solution outlined above works, but it's clunky. Unless you're adept at shell scripting, you have to manually start each process while avoiding any overlap in the specified ranges of identifiers.

It would be safer and more convenient if we could start our web scraper, specify the number of parallel process we need, and let our Python code do all the rest, like this:

python scrape.py --num_processes=10

Python's multiprocessing module—part of the standard library since 2008—makes this possible.

Here's a simplified version of the internals of scrape.py before we add parallel processing:

And here is what it looks like after we add parallel processing:

The key differences are:

  • From the multiprocessing module, we import the Pool class (line 3). A Pool controls the subprocesses that can handle jobs in parallel.
  • Instead of initializing the Requests Session immediately (line 5), we define a function for setting it (lines 7 through 10). Because each parallel process runs in its own memory space, they can't share the same Session object. Rather, each process needs to initialize its own, which we specify when we set up our Pool (line 24).
  • By default, the number of worker processes available to Pool will equal the number of central processing units (CPUs) on your machine. You can change this via the processes argument when you initialize the Pool.
  • Finally, we use the .map() method to apply the cache_page function to each item in our list of identifiers (line 25).

The results

With the aid of our speedy web scraper, Garrison was able to expand the scope of his investigation across more Indiana counties and a longer time period.

Even as we continued collecting data, Garrison shared preliminary findings with the Indiana Office of Judicial Administration. They were annoyed to learn that their website was being scraped. A few days later, the agency responded to his original records request for a complete snapshot of the protection order registry.

“Using the web scraper, we were able to break through the government's intransigence and obtain important information about problems serving protection order paperwork,” Garrison said.

We helped him get the data that formed the basis of his investigation and along the way, we learned a helpful web scraping trick for our future projects. After reporting out his findings, Garrison plans to pitch his story to the Indianapolis Star

Further reading

Comments

Comments are closed.