We put data science to the test to try to uncover the mystery author of the Times’ Op-ed

Michael W. Kearney

The New York Times created a national whodunit when it published an anonymous opinion piece about the Resistance Inside the Trump White House. The author was identified only as “a senior official in the Trump administration.”

President Trump certainly wanted to know who it was: “If the GUTLESS anonymous person does indeed exist, the Times must, for National Security purposes, turn him/her over to government at once!” he tweeted.

Trump tweets.

Early speculation focused on the use of the word “lodestar”, an usual term previously used by Vice President Mike Pence in a CBS interview. Others have argued that such an obvious use of a unique term is instead a form of intentional misdirection.

The knee-jerk reaction to look at communication patterns among senior-level White House officials makes some sense. But doing so without leveraging recent advances in data science would be a huge waste and, for me, a total bummer.

Ultimately, the question is an empirical one. Thanks to the wonders of digital media and data science, we can start to gain insights in a relatively short amount of time. And since a lot of the work I do involves analyzing political communication on Twitter anyway, I figured I’d give it a shot as a way to demonstrate what’s possible with some data science training, plenty of data, and a little bit of time.

Spoiler alert: I don’t have the name you want. But I took a fascinating trip in communications analysis. Here’s how:

1. I collected op-ed text

First, I grabbed the [reference] text from the actual New York Times op-ed.

2. I collected up to the most recent 3,200 tweets from each [Twitter-using] member of Trump’s cabinet

Because I needed some writing/communication samples to compare against the text of the op-ed, I turned to Twitter because it only seemed apropos given communication preferences of the current administration and because I just so happened to maintain some open-source software for interacting with Twitter’s APIs. For the sake of time, I decided to limit my analysis to members of the president’s cabinet. Plus, given the only description we have of the author, it seemed fairly safe to assume that a senior official in the Trump administration would, at the very least, be an accurate description of a member of Trump’s Cabinet.

3. I split op-ed text into paragraphs

With a reference text and a number of samples from Twitter, I then split the op-ed text by paragraph to roughly match the length of tweets.

4. I converted each text into 107 numeric features

Using this supply of texts matched to screen name or “op-ed” authors, I then extracted estimates for over 100 features (numeric representations of observed patterns in the text) for each string of text. Some examples of the features include capitalization, punctuation (commas, periods, exclamation points, etc.), use of white space, word length, sentence length, use of ‘to be’ verbs, and numerous thesaurus-like representations of word dimensions (similar to dividing commonly used words into eighty different topics and then measuring the extent to which each text used words from each topic).

5. I estimated the correlation between the op-ed texts and the tweets posted by each Cabinet member account

To get an actual measurement of similarity between the op-ed and the Twitter screen names, I averaged the numeric features by author, and then used those values to estimate the correlations (measure of association ranging from -1 to 1) between the op-ed texts and the Twitter user texts.

You can find the code I used on Github. And here is the visual representation of the correlation coefficients:

Text similarity of NYT op-ed and Cabinet tweets

And the writer is …

Looking at the chart, it would appear that the vice president is the writer. Him, or Trump himself. Fat chance of that.

But don’t leap there.

This exercise is useful for illustrating how it’s possible to use data science to estimate the similarity between multiple texts, but it does not provide any conclusive evidence to answer the question of who authored The New York Times op-ed.

In fact, there are numerous reasons why you should be skeptical of inferences made from this analysis. A few:

  • The comparison texts are limited because they were designed for Twitter and not the Times. They were also authored by users who presumably were OK with their identity being connected to their Tweets (unlike the anonymous author behind the op-ed).
  • The pool of comparison texts doesn’t even represent all of the possible options of people who would fit the description of a senior official in the Trump administration. For instance, this omits from consideration any Cabinet member who doesn’t have or use Twitter. It also omits anyone who works in the Trump administration but who is not an official member of the Cabinet.
  • The text similarity analysis assumes the author of the anonymous op-ed didn’t try (or failed) to disguise their own communication patterns. Even if they did try to disguise themselves, it’s certainly possible that some communication patterns still slipped through the cracks, but it’s also possible they provided enough false positives (i.e., “lodestar”) that most algorithms would get it wrong.
  • It assumes the people who write the tweets are the actual people who they purport to represent — Trump does not operate the @POTUS account, for instance. But in most cases, we never know for sure. It’s also entirely possible that some internal communications person influences the messages sent on behalf of several of those accounts. Or it’s possible that administrations tend to have overlapping communication patterns because they strive for monolithic communication and, as a consequence, there are a handful of near matches in any sample of texts composed by people working in the White House.