AP’s DataKit: Helping teams collaborate efficiently
Data projects have many moving parts, and coordinating all the pieces can be challenging even for one enterprising journalist, let alone an entire data team simultaneously pushing on multiple fronts.
To help bring sanity to their workflow, the Associated Press developed DataKit, a simple command line tool for setting up and syncing data analysis projects. After using the tool internally for nearly two years, the AP publicly released DataKit in September 2019.
AP Data Editor Meghan Hoyer gave a tour of DataKit in a recent webinar. She said the tool has helped her team collaborate more efficiently, especially on big projects like their work covering opioid distribution trends. Though, DataKit isn't just for big projects. Every data analysis by the AP is initiated and managed with DataKit.
“It clears the way for you to do your best work in terms of analysis and digging into data,” Hoyer said.
Hoyer presented DataKit along with Serdar Tumgoren, DataKit's lead developer. Tumgoren previously worked with Hoyer at the AP. He currently teaches data journalism at Stanford University.
Given the recent rollout, I decided to give DataKit a whirl. It's free, easy to install and, in my opinion, a great tool with a lot of potential.
DataKit is essentially a wrapper around other command line utilities you may already know. Under the hood, DataKit relies on Cookiecutter, a project templating tool that helps automate the process of setting up projects in any programming language. Data journalists at the AP tend to work in R, but DataKit would be just as suitable for those who prefer Python.
Project templates beget more consistency between projects and less fussing with all the annoying minutiae—folder structures, file names, locations of remote repositories—any of which can trip you up when dipping into someone else's code.
For a typical data analysis project, the code lives in one place (probably a GitHub repo) while the actual data live in another (probably a bucket on Amazon's Simple Storage Service).
Consequently, setting up and sharing a project requires interacting with at least two separate systems (maybe more), each with distinct controls and configurations. As a result, precious time needed to hone your analysis is instead spent navigating inscrutable command line interfaces.
DataKit aims to reduce the workflow outside of analysis to the most essential steps and collect them under a single entry point.
You can spin up a new project with one simple command:
datakit project create
After ushering you through a few prompts prescribed by your project template, DataKit drops you into your new project's directory, ready to start coding.
You can then link your new project to a pre-existing S3 bucket shared by your team:
datakit data init
Then you can push that data to that bucket:
datakit data push
Then (after cloning your project to their own machines) your colleagues can pull that data down:
datakit data pull
The AP does great work, but maybe their process doesn't exactly fit how you do data journalism. Luckily DataKit doesn't tie you down to the AP's way of doing things.
Say, for instance, you don't like the AP's scaffolding for R projects. You could fork their template, make a few tweaks, and use your version of their template in your projects.
You could also create your own template from scratch. If you go that route, check out Cookiecutter's documentation, in particular the section on pre/post-generate hooks, which allow you to further automate your project setup process. For instance, the AP's Python project cookiecutter will automatically set up a virtual environment (via Pipenv) and install pandas, ipykernel and other Python packages typically used in data analysis projects.
You can further customize DataKit using plugins. For example, the AP stores their code on GitLab so, as you might expect, they've developed a plugin for integrating DataKit projects with GitLab repos. However, since a lot of organizations prefer GitHub, the AP also developed a GitHub plugin for DataKit.
With a little extra effort, you can also develop and release your own plugin, and this is where I was struck by the tool's true potential. Datakit isn't a monolithic application. Rather, it's a framework that can be easily extended to cover a wide variety of use cases.
Case in point: data and interactive journalist Marc Lajoie, most recently at the CBC/Radio-Canda, liked what he saw on the AP's webinar and decided to build off of their work. Within a few days, he rolled out two additional DataKit plugins:
- datakit-data-gdrive for folks who want to sync their project data to Google Drive
- datakit-bitbucket for integrating with yet another source control management system.
“It really is enormously flexible, and pretty straightforward to adapt to any technologies/services that have Python bindings,” Lajoie said.
It's open source
Hoyer said her team uses DataKit every day. For that reason, the AP will continue to support and enhance the tool.
“It's not like [DataKit] is going to fade away in the next couple of months,” Hoyer said.
Further development will happen in the open with input from users in other news organizations. The source code for DataKit, as well as many of its plugins, is publicly available on GitHub, where anyone can submit suggestions and even pitch in on the development. The AP has also set up the #proj-ap-datakit channel on the News Nerdery Slack, where journalist who might benefit from the tool are likely to hang out.
Why projects like these are so important
“Collaboration” is a buzzword we've been hearing and repeating non-stop for years now in professional circles. Projects like DataKit are where the rubber meets the road.
Will we define the tools that facilitate our best work? Will we do it in a way that enables others to adapt those tools to meet their particular needs? The answer put forth by DataKit is a resounding “yes, we will.”