Need for speed 2: Newspaper data diving, metrics and methodologies
Welcome to the weeds, fellow bit-twisters and data divers. We can chat here without worrying about the numeracy nonbelievers. This post details the methodologies used in “Need for speed 1: Newspaper load times give ‘slow news days’ new meaning.”
First, you and I both know “load time” is a fickle metric, completely dependent on the user’s connection speed and location. These screenshots of Pingdom Website Speed Test results show load times for The Washington Post: first, 3.76 seconds from Dallas; a minute later, 16.46 seconds from San Jose, just a few states away.
I hope the number of results in this report (1,455 newspapers) smooths out these variances enough to approach an overall average of actual load times. But maybe not. So, as with the Tools We Use reports, consider the data herein more generally informative than statistically precise.
The steps to improve the accuracy of my results include:
- Rerunning tests for 100-plus sites with greater than 50-second load times (exceeding two standard deviations), which returned more reasonable results for 95 percent of those sites.
- Visually inspecting all sites with less than four-second load times and deleting any not really news sites (loosely defined as regularly updated linked lists of news articles and sections).
Perhaps I can divert you from my data deficiencies with some eye candy:
Note the nice funnels of correlation for requests and page weight (at the top). I expected PageSpeed scores, Alexa rank and other factors to also correlate well. None did.
So I grabbed a bunch of other data, including bounce rates, DOM elements and monthly visits. But requests and page weight remained the only factors I found in lock-step with load times:
|Load time||Speed index||Mean||Median|
|Page weight (MB)||0.683||0.461||4.7MB||3.9MB|
I now have a sea of data I’m just starting to wade through. I’ll be looking at other load time correlations, like with CMS and servers. If you have suggestions, please comment below.
The newspaper data came mostly from API calls to:
- WebPagetest.org for load times, speed index, requests, page weight and DOM elements.
- Google PageSpeed Insights for desktop, mobile, and mobile UX scores.
- SimilarWeb for bounce rate, visits and pages-per-session.
- Alexa for the global site ranking.
- BuiltWith and W3Techs for CMS, server and widgets.
The “Tools We Use 1“ methodologies section details how we compiled the list of newspapers and identified their CMS and servers. The URL tested for all results was the newspaper’s homepage. The WebPagetest setting used using Chrome from Dulles, Virginia, simulating cable bandwidth.
In the top table of “Need for speed 1,” the “U.S newspapers“ averages are of all results for load times, requests and page weights (from WebPagetest.org), and of desktop scores (from Google PageSpeed Insights).
The averages of “All sites” for requests, page weights and desktop scores are from the HTTP Archives (September 2015). Average load time for all sites is a slippery statistic. To get close I used the admittedly shaky method of determining the seconds at which Pingdom Tools switches from reporting, “Your site is slower than X% of sites” to “Your site is faster than. …” That happened at about 5.3 seconds.
And noting way down here where no one will notice: “Widgets” isn’t really only widgets, it’s everything BuiltWith calls a technology, from a plugin for WordPress or jQuery, to WordPress and jQuery itself (so the total is proportional to the number of requests). I included this factor only because it was one of the few with decent correlation to load time.
Thanks to BuiltWith for donating an account to RJI for this project. Thanks to Michael Jenner and Randy Picht for direction, and to Harsh Taneja for deviations (of the standard kind). The top image comes from the University of Missouri yearbook Savitar, 1956.
I’ll leave you with one last data dump:
|Load time (seconds)||Requests||Page weight (MB)|