Why The New York Times is Working With Matter

For years I’ve followed the progress of Matter Ventures, the San Francisco-based media accelerator run by Corey Ford. I can no longer remember exactly how I was introduced to Matter and to Corey, but I do remember the first demo day that I attended a few years ago, at an event space associated with WNYC. Somehow I had gotten an invitation, but I was on the fence about whether to go. I had recently started in a new role, and was feeling pressed for time. In the end, I went, mostly because an engineering lead I was trying to hire was likely to be there and I was hoping to stalk him in his natural habitat.

When I got there, I recognized a face, then another, and another, and I realized I was walking into a room full of many of the most talented people in digital media in New York, with all sorts of opportunities for stalking talent! And this from an accelerator that was based in San Francisco. When the demos started, and the ideas began to flow, I was sure there was something special going on.

Over the next several years I got to know Corey and his program better. I was more and more impressed. The rigorous application of design thinking, the selectiveness applied to the participating startups, the quality of the ideas and the people, the energy surrounding the whole process, all supported my initial reaction.

I also loved the enthusiasm and optimism around the potential of digital media. Despite the very real disruption of the industry, Matter clearly believes in the potential of media to reinvent itself, evolve and thrive. I do too.

However, it was never practical for my New York-based organization to actually work with Matter because they ran their program in San Francisco. While I understood the obvious appeal of the Bay Area, I thought it was a shame for Matter’s presence in New York to be limited to the demo day — a missed opportunity for both Matter and the city that is the undisputed media capital of the world.

So I couldn’t be more pleased that Matter is finally launching a class in New York — with the participation of The New York Times and the support of the Google News Lab. I’m so happy to be able to offer the team here at The Times the opportunity to work alongside the inaugural Matter NYC class. And as a veteran of the first wave of the Silicon Alley startup scene back in the ’90s, I am thrilled to play a small role in injecting this wonderful ingredient into what is now an incredibly vibrant tech/media/startup scene in New York.

Read more...

Our Tagged Ingredients Data is Now on GitHub

Since publishing our post about “Extracting Structured Data From Recipes Using Conditional Random Fields,” we’ve received a tremendous number of requests to release the data and our code. Today, we’re excited to release the roughly 180,000 labeled ingredient phrases that we used to train our machine learning model.

You can find the data and code in the ingredient-phrase-tagger GitHub repo. Instructions are in the README and the raw data is in nyt-ingredients-snapshot-2015.csv.

There are some things to be aware of before using this data:

  1. The ingredient phrases have been manually annotated by people hired by The New York Times, whose efforts were instrumental in making the success of our model possible.
  2. The data can be inconsistent and incomplete. But what it lacks in quality, it makes up for in quantity.
  3. There is not a tag for every word and there are sometimes multiple tags per word.
  4. We have spent little time optimizing the conditional random fields (CRF) features and settings because the initial results met our accuracy needs. We would love to receive pull requests to increase the accuracy further.

Examples

INPUT NAME QUANTITY UNIT COMMENT
1 6-inch white-corn tortilla white-corn tortilla 1.0 6-inch
3 cups seedless grapes, equal amounts of red and green grapes grapes 3.0 cup seedless, equal amounts of red and green
1/4 cup good quality olive oil good quality olive oil 0.25 cup
3 large cloves garlic, smashed garlic 3.0 clove smashed
Rind from 1/2 pound salt pork salt pork 0.5 pound Rind from

Learning and Exploring on 100% Day

When you hear “hackathon” you envision a scene from “The Social Network”: bleary-eyed developers working into the early morning, slamming energy drinks, furiously typing away on keyboards; the end goal being to best your competition, and be showered in glory from your peers. When I joined The New York Times two years ago, I assumed my first 100% Day would be similar; nothing could be further from the truth.

The Times periodically hosts an internal 100% Day. A typical 100% Day, or “Maker Day” at The Times fosters a spirit of collaboration and personal development. We use this time to better ourselves, to learn something new, or build something that we may be interested in. It’s a time where we can dig in, hang out with other members of the organization, and just learn. At the end of the day, we share whatever knowledge we’ve gained, or things we’ve built, with the rest of the company.

The March version of 100% Day was no different. Inspired by the work of Lara Hogan, I spent most of my day investigating ways to boost the speed of our current site. I dug into a tool called vmprobe and found ways to optimize our autoscaling efforts on the Real Estate section.

There were dozens of talks, ranging from researching findings to full blown demos. Here are some that I personally found interesting:

NYT Reactions: Jared McDonald, Jeff Sisson and Angel Santiago, from Technology, built a system to allow emoji reactions to articles. The goal was to create a mechanism for user feedback that’s more at ease in a mobile setting, and which could attract a reader that would not otherwise be comfortable composing a fully fledged comment. The system was designed to be flexible enough to accept a range of emotional reactions; so anything from “Recommend” to
” is possible.

Lunchbot: Chris Ladd, from Digital Design, built a Slack bot to respond to the question: “What’s on the menu in the cafe on the 14th floor?” Chris demonstrated his code live on stage, and integrated the bot into several NYT Slack channels. He has also graciously open sourced his code. If you’re so inclined, you can check out the code on GitHub.

Read more...

Improving Startup Time in the NYTimes Android App

Improving application startup and load time has been a priority for The New York Times Android development team, and we’re not alone. As device manufacturers continue to offer faster and more fluid experiences, users expect their native apps to be faster still.

Our team recently rewrote our news app to take advantage of modern patterns such as dependency injection and reactive programming. The rewrite offered improvements in maintenance, code modernization and modularization benefits, but required some adjustments to optimize.

When we initially released our new app, which we nicknamed Phoenix, startup time was a modest 5.6 seconds on a Nexus 5. This was longer than our goal: 2 seconds or less. This motivated us to put some effort into improving our performance.

We found that most of the slowdown was caused by issues with reflection. After addressing these and fixing other smaller things, we’ve reduced our current startup time to 1.6 seconds.

How We Did It

First, we captured our app’s startup time with Android’s method tracing feature. We measured from the Application Class constructor to the end of the progress indicator appearing on screen (see the documentation).

We then collected the resulting trace files for analysis by loading the trace into DDMS and finding the largest performance offender. We eventually switched to using NimbleDroid, which also offered a simple way to identify bottleneck issues, making it easier to compare performance across traces.

The Low Hanging Fruit

The first major slowdown we found was related to the large number of classes, memory-intensive runtime and expensive calls to loading Jar resources required by Groovy, something previously identified as a problem in other libraries such as Joda Time. We primarily used Groovy for closures; improvements in code folding within Android Studio have resolved that need. Since we didn’t use any other constructs of the language, we decided to move away from it. We reverted back to plain old Java 7 syntax and stripped our codebase of Groovy. We’re currently exploring other options but enhancements in IDE support for viewing anonymous classes in Java have made it less of a priority.

Read more...

Flash-Free Video in 2016

At the beginning of this year, we officially turned off Flash support for VHS, the New York Times video player. We now use HTML5 video technology for all video playback on desktop and mobile web browsers. Flash was a very powerful and popular technology in its day but it has waned over the years as browsers have embraced the open standard of HTML5. Throughout the second half of 2015, Chrome, Firefox and Safari also began blocking the Flash plugin from automatically loading content unless users gave their permission. In order to continue providing a quality video experience for our viewers, switching exclusively to HTML5 video became necessary.

Background

Video has become a critical part of storytelling at The New York Times, and it has seen tremendous growth over the past few years. With this growth, as well as a new company-wide focus on video, we decided to build our own video player last year rather than continuing to use third-party solutions. There were many reasons for the move, but primarily we wanted to own the entire video experience in-house in order to focus on our needs and values. Performance, premium video quality and ability to customize the experience were our primary concerns. With video used in both articles and in unique, custom experiences, we have a diverse set of requirements. We wanted to give the newsroom a flexible, extensible player with a core foundation so they could focus on creating the experience. With this aim, we built a JavaScript wrapper around a video element that could either be an HTML5 video element or a Flash OSMF video object. For extensibility, we included a light, event-driven plugin system.

Why Flash?

When we first developed the player, Flash was still the dominant technology for publishers. This was largely because most of the ad inventory was still in Flash. VPAID ads were almost exclusively created with Flash, and advertisers preferred to deliver their ads in a VPAID wrapper so they could add their own tracking.

Another reason we needed Flash was to play legacy content. Most of our older content (stretching back to 2006 on our earlier website) was encoded with On2 VP6, which requires Flash.

Read more...

How to Build a TimesMachine

At the beginning of this year, we quietly expanded TimesMachine, our virtual microfilm reader, to include every issue of The New York Times published between 1981 and 2002. Prior to this expansion, TimesMachine contained every issue published between 1851 and 1980, which consisted of over 11 million articles spread out over approximately 2.5 million pages. The new expansion adds an additional 8,035 complete issues containing 1.4 million articles over 1.6 million pages.

Creating and expanding TimesMachine presented us with several interesting technical challenges, and in this post we’ll describe how we tackled two. First, we’ll discuss the fundamental challenge with TimesMachine: efficiently providing a user with a scan of an entire day’s newspaper without requiring the download of hundreds of megabytes of data. Then, we’ll discuss a fascinating string matching problem we had to solve in order to include articles published after 1980 in TimesMachine.

The Archive, Pre-TimesMachine

Before TimesMachine was launched in 2014, articles from the archive were searchable and available to subscribers only as PDF documents. While the archive was accessible, two major problems in implementation remained: context and user experience.

Isolating an article from the surrounding content removes the context in which it was published. A modern reader might discover that on July 20, 1969, a man named John Fairfax became the first man to row across the Atlantic Ocean. However, a reader absorbed in The New York Times that morning might have been considerably more impressed by the front page news that Apollo 11, whose crew contained Neil Armstrong, had just swung into orbit around the moon in preparation for the first moon landing. Knowing where that John Fairfax article was published in the paper (bottom left of the front page) as well as what else was going on that day is much more interesting and valuable to a historian than an article on its own without the context of other news of the day.

We wanted to present the archive in all its glory as it was meant to be consumed on the day it was printed — one issue at a time. Our goal was to create a fluid viewing experience, not to force users to slowly download high resolution images. Here’s how we did that.

Read more...

Introducing Gizmo

At The New York Times, our development teams have been adopting the Go programming language over the last three years to build better back-end services. In the past I’ve written about using Go for Elastic MapReduce streaming. I’ve also talked about using Go at GothamGo for news analysis and to improve our email and alert systems at the Golang NYC Meetup. We use Go for a wide variety of tasks, but the most common use throughout the company is for building JSON APIs.

When we first began building APIs with Go, we didn’t use any frameworks or shared interfaces. This meant that they varied from team to team and project to project with regard to structure, naming conventions and third-party tools. As we started building more and more APIs, the pains of microservices started to become apparent.

Around the time we reached this point, I came across Peter Bourgon’s talk from FOSDEM 2015, “Go and the Modern Enterprise,” and its accompanying blog post. A lot of what Peter said hit close to home for me and seemed very relevant to our situation at The Times. His description of the “Modern Enterprise” fit our technology teams quite well. We’re a consumer-focused company whose engineering group has more than doubled in size to a few hundred heads in recent years, and we have had a service-oriented architecture for a long time. We have also run into the same problems he brought up. As the need for more and more microservices arose, we needed a common set of tools to orchestrate and monitor them, as well as a common set of patterns and interfaces that enable developers to concentrate on business problems. An RFC for a new toolkit named “Go Kit” came out of the talk, and eventually open source development of it was under way.

Peter’s talk and the concept of Go Kit made me very excited but also somewhat dismayed. I knew a lot of the RPC and tracing technology involved would likely take a long time to be adopted throughout the company without some stepping stones to get us there. We also didn’t have a whole lot of time to wait around for the toolkit to be completed, so we decided to build our own set of tools that could bridge the gap and hopefully complement Go Kit by implementing some of its “non-goals.”

Read more...

Using Go and Python NLTK for News Analysis

I had the opportunity to speak at this year’s GothamGo conference in New York City about a side project I’ve been working on at The New York Times for several years now: Newshound.

Newshound is a breaking news email aggregator that originated at one of the company’s developer events and ended up winning an internal contest in early 2013. Since then, I’ve used the platform as a guinea pig for trying out new technologies. In the latest iteration, I rewrote the core of Newshound with the Go programming language but left an essential piece of software in its original Python implementation. My talk covers the obstacles I had to overcome to complete this recent rewrite.

IoT and Cassandra: Topic Wildcards in Retained Storage

I recently presented at the 2015 Cassandra Summit on the Internet of Things (IoT) and our work at The New York Times, drawing on over fifty years of experience in computing.

The IoT contains Things of Interest (ToI) to our users and to The New York Times. While the intersection contains objects we are familiar with now (such as mobile devices, watches and laptops), by 2020 the mutual ToI will vastly expand. How should we respond?

One way, which I discuss in my presentation, is to generalize and expand our global services at a reasonable cost. This will allow our very creative marketers and developers to deploy great apps quickly and solidify each user’s experience across their ToI.

Building the Next New York Times Recommendation Engine

The New York Times publishes over 300 articles, blog posts and interactive stories a day.

Refining the path our readers take through this content — personalizing the placement of articles on our apps and website — can help readers find information relevant to them, such as the right news at the right times, personalized supplements to major events and stories in their preferred multimedia format.

In this post, I’ll discuss our recent work revamping The New York Times’s article recommendation algorithm, which currently serves behind the Recommended for You section of NYTimes.com.

History

Content-based filtering

News recommendations must perform well on fresh content: breaking news that hasn’t been viewed by many readers yet. Thus, the article data available at publishing time can be useful: the topics, author, desk and associated keyword tags of each article.

Our first recommendation engine used these keyword tags to make recommendations. Using tags for articles and a user’s 30-day reading history, the algorithm recommends articles similar to those that have already been read.

Because this technique relies on a content model, it’s part of a broader class of content-based recommendation algorithms.

The approach has intuitive appeal: If a user read ten articles tagged with the word “Clinton,” they would probably like future “Clinton”-tagged articles. And this technique performs as well on fresh content as it does on older content, since it relies on data available at the time of publishing.

However, this method relies on a content model that, sometimes, has unintended effects. Because the algorithm weights tags by their rareness within a corpus, rare tags have a large effect. This works well most of the time, but occasionally degrades the user experience. For instance, one reader noted that while she was interested in same-sex pieces, occasionally in the Weddings section, she was being recommended wedding coverage about heterosexual couples. This is because a low-frequency tag, “Weddings and Engagements,” was in an article previously clicked, outweighing all other tags that may have been more applicable for that reader.

Collaborative Filtering

To accommodate the shortcomings of the previous method, we tested collaborative filtering. Collaborative filters surface articles based on what similar readers have read; in our case, similarity was determined by reading history.

Read more...

Extracting Structured Data From Recipes Using Conditional Random Fields

In 1994, a member of the newsroom named Rich Meislin wrote an internal memo about the value of “computer-based services” that The Times could offer its readers. One of the proposed services was RecipeFinder: a database of recipes “searchable by key ingredient” and “type of cuisine.” It took the company almost 20 years, several failed starts and a massive data cleanup effort, but the idea of cooking as a “digital service” (read: web app) is finally a reality.

NYT Cooking launched last fall with over 17,000 recipes that users can search, save, rate and (coming soon!) comment on. The product was designed and built from scratch over the course of a year, but it relies heavily on nearly six years of effort to clean, catalogue and structure our massive recipe archive.

We now have a treasure trove of structured data to play with. As of yesterday, the database contained $17,507$ recipes, $67,578$ steps, $142,533$ tags and $171,244$ ingredients broken down by name, quantity and unit.

In practical terms, this means that if you make Melissa Clark’s pasta with fried lemons and chile flakes recipe, we know how many cups of Parmigiano-Reggiano you need, how long it will take you to cook and how many people you can serve. That finely structured data, while invisible to the end user, has allowed us to quickly iterate on designs, add granular HTML markup to improve our SEO, build a customized search engine and spin up a simple recipe recommendation system. It’s not an exaggeration to say that the development of NYT Cooking would not have been possible without it.

Read more...

Purifying the Sea of PDF Data, Automatically

Many government agencies publish data about their work on a regular basis, often daily or weekly. Some conveniently post it in easy-to-use formats such as CSV files. Others, however, seem to disclose it begrudgingly, each week posting a stack of PDF files on a website and removing the previous set, with no archive anywhere. At least they’re publishing at all, right?

I’ve been working on a pattern for purifying this sea of data, so even as agencies pour in more dirty PDFs, I automatically get pure, clean CSVs. I don’t have a full, generic solution to solve this problem concisely and elegantly, but I want to share this pattern I’ve created.

Data published as a PDF is a hindrance we may never be rid of. While we all know PDFs are designed for presentation, not data analysis, that doesn’t stop agencies from handing them out in response to data requests. Tabula, an open source program for liberating data, has solved part of this problem since it was released in early 2013, but it can be time-consuming to process data contained in multiple PDFs.

The process is particularly annoying for data sets published on a regular basis, like the weekly NYPD precinct-level crime complaint tables. Not only do you have to check daily for new data, but once it’s available, you then have to fire up Tabula to process it — for each of the 85 precincts. It’s duplicative grunt work.

My pattern solves this problem using tabula-extractor, the Ruby library (and command-line tool) that powers Tabula. It’s built to output data to CSVs or to a MySQL database.

I haven’t quite figured out how to fully abstract the solution, and some work specific to each part of the pattern is still necessary. Nevertheless, the pattern is a step toward figuring out what is shared between different instances of the problem — that is, what can eventually be generalized into a library — and boiling down the differences into the inputs for a library. I’ve open sourced three instances of the pattern (that is, three scrapers) for the following data sets:

– Sierra Leone’s Ebola situation reports: GitHub
– The NYPD’s CompStat criminal complaints database weekly reports: GitHub
– The NYPD’s monthly reports of moving summonses: GitHub

Take a look. Each project contains a few executable scripts, in the bin/ folder for parsing files from the web, from the disk or from Amazon S3. All of them use a common parsing script in the lib/ folder. This parser is where the magic, such as it is, happens. Based on configuration options, the parser processes the PDF, extracts the relevant data (using a page number and table dimensions, if they’re common to all PDFs) and then sends it to a CSV file or MySQL database and saves the PDF itself to Amazon S3 storage or to disk.

Precisely how to process the PDF and how to store the data is a pattern I’m still working on. If these scrapers are useful to you, I’d love to hear your thoughts.

A New View for NYTimes Photos

The New York Times loves photos. We publish around 700 images every single day. These run the gamut from delectable dessert close-ups to evidence of the ravaging effects of climate change. We have an entire blog dedicated to photojournalism, and we currently work with over 3,600 photographers who have shot over 30,000 assignments in the past two years. Photos are an important part of our journalism and the stories we tell.

With our keen focus on photos and photojournalism, it may come as a surprise that the photo viewer in our iOS apps hasn’t changed significantly in years. In that time, some patterns have emerged that improve photo viewing on iOS, including flicking to dismiss and zooming image transitions. We set out to bring these patterns to our core iOS app in a reusable and extensible way by rewriting our photo viewer from the ground up. The New York Times has gotten serious about writing robust, minimally dependent, well-tested and fully documented components for all our apps, and this project reflects that philosophy.

Although we have a popular open source Objective-C style guide, The Times has never released an open source Objective-C project, and we wanted to take this opportunity to not only modernize our photo viewing experience, but also to write a new feature that could be shared and used in any app, both inside and outside the company. It took us a long time to open source something iOS-related, but it’s a start. And I hope that it’s the beginning of a greater focus from The Times on sharing and learning from the larger iOS community.

The photo viewer is available on GitHub, and we look forward to all of your issues, pull requests and feedback!

How to Unit Test a RequireJS Application

Adding unit tests to an existing application can be a challenge. If the application wasn’t developed with testing in mind, getting your source code into shape so that your components are isolated and testable is like adding a new character to a story after the book is already written. Significant refactoring must be done to decouple and test your code.

A particular challenge in writing unit testable code is mocking the dependencies in your test subjects. You need some means of exposing the “handles” of dependencies so that a test can pull them out and replace them with controllable mocks. If you don’t replace these dependencies, you can’t write a true unit test, since breakage in a dependency would result in a failing test. This breaks the contract of a unit test. If the subject being tested works correctly, the test should pass, even if a service or utility the subject depends on is faulty. Failures in the service or utility the test subject depends on should be caught by a separate unit test written for that service or utility.

At The New York Times, the CMS team has been writing unit tests for a new article editing application we developed for Scoop. Because we built it with Backbone.js and because it is AMD-compliant (each Backbone component or shared service, vendor library or utility class is a RequireJS module), the app was pretty modular even before we considered unit testing.

However, even though RequireJS provides numerous benefits in terms of dependency resolution and modularity, it doesn’t provide a means of doing dependency injection. When RequireJS sees that one module depends on another, it resolves the dependency module against a path in a configuration file before the dependent module is defined. This binds each dependency to an implementation and precludes dependency injection.

Consider this example of a Backbone (Brace) Lock model, a RequireJS module that depends on the UserService module:

// Lock.js
define(function(require) {
 
  var UserService = require("services/userService.js");
 
  return Brace.Model.extend({
 
    namedAttributes: {
      user: Object,
      name: String,
      lockedByCurrentUser: Boolean,
      // more attributes ...
    },
 
    initialize: function(options) {
      this.lockedByCurrentUser = UserService.getCurrentUser().get("userId") === this.get("user").userId;
    },
 
    // ...
 
  });
});

UserService is declared outside of the body of the Lock and is defined (by RequireJS) before Lock’s definition is evaluated. Lock is then a closure around UserService, and all references to UserService are internal to Lock. This encapsulation is good because clients of the Lock shouldn’t use the UserService through the Lock, but it creates a problem when the client of the Lock is a unit test. The UserService has no handle by which a unit test could mock it out.

Read more...

TimesOpen Hack Day 2014

At the fifth annual TimesOpen Hack Day hosted by The New York Times Developers, people from around the city came to hack with us and our API and platform partners CartoDB, Enigma and Google. For most attendees, it was their first-ever hack event.

Here are some highlights from the day:

Contrarian by Justin Sung and Chuck Pierce
(Best in Show)
This Chrome extension searches for New York Times articles similar to the one being read and opens a new tab of stories with opposing views on the same topic.

The Know York Times Quiz by Ofer Bronstein, Benjamin Conant, Hugo Marcotte, Faisal Nawaz and Griffin Telljohann
(Best Use of a New York Times API)
Inspired by a New York Times weekly news quiz, the team created a web app to generate quizzes automatically, based on stories in The Times.

Georgie St. Claire by Lindsay Levine and Sam McCord and Sphynx by Nathan Epstein, Jeffrey Klaus, Gabriel Lebec, Christian Sakai and Oddur Sigurdsson
(Tied for People’s Choice)
Georgie St. Claire is a Twitter bot newsreader and entertainment experience incorporating the New York Times Article Search API. Sphynx linked a visualization platform to an Oculus Rift to show code and graphs in 3-D space.

Reddnyt by Alastair Coote
Reddnyt applies Reddit’s ranking algorithm to New York Times articles shared on Facebook and Twitter. It was created by a developer at The New York Times. It’s best viewed on a mobile device.

Perooz by Sneha Inguva and MD Islam
The team working on Perooz showed off a Chrome extension for annotating and commenting on reporting, source development and crowdsourced information.

Newstips by Nicole Dominguez, Daniel Gonzalez, Stefan Huynh, Matt Nelson and James Walker
Making the most of crowdsourced information can be a challenge when there’s a lot of material coming in. Newstips gives users an interface for leaving a tip that includes geolocation, making it easy for reporters or other end users to find and sort tips by place.

Thanks to all of our 2014 Hack Day participants. For more from TimesOpen, check out slides and other material from earlier events on transitioning to continuous delivery and reactive programming and join us in 2015. Sign up for announcements at http://developers.nytimes.com/events/newsletter/.

Advertisement