In 1994, a member of the newsroom named Rich Meislin wrote an internal memo about the value of “computer-based services” that The Times could offer its readers. One of the proposed services was RecipeFinder: a database of recipes “searchable by key ingredient” and “type of cuisine.” It took the company almost 20 years, several failed starts and a massive data cleanup effort, but the idea of cooking as a “digital service” (read: web app) is finally a reality.

NYT Cooking launched last fall with over 17,000 recipes that users can search, save, rate and (coming soon!) comment on. The product was designed and built from scratch over the course of a year, but it relies heavily on nearly six years of effort to clean, catalogue and structure our massive recipe archive.

We now have a treasure trove of structured data to play with. As of yesterday, the database contained $17,507$ recipes, $67,578$ steps, $142,533$ tags and $171,244$ ingredients broken down by name, quantity and unit.

In practical terms, this means that if you make Melissa Clark’s pasta with fried lemons and chile flakes recipe, we know how many cups of Parmigiano-Reggiano you need, how long it will take you to cook and how many people you can serve. That finely structured data, while invisible to the end user, has allowed us to quickly iterate on designs, add granular HTML markup to improve our SEO, build a customized search engine and spin up a simple recipe recommendation system. It’s not an exaggeration to say that the development of NYT Cooking would not have been possible without it.

Until recently, the collection and maintenance of this structured data was a completely manual process. For years, overnight contractors have entered recipes, dropdown by dropdown, into a gray and white web form that lives in our content management system (CMS). Since the database breaks down each ingredient by name, unit, quantity and comment, an average recipe requires over 50 fields, and that number can climb above 100 for more complicated recipes.

I long suspected that the manual process of entering recipes into the database could be replaced with an algorithmic solution. The field of Natural Language Processing (NLP) has developed powerful algorithms to solve similar tasks over the past decade. If a computer can identify the part of speech of each word in a sentence, it should be able to identify an ingredient quantity from an ingredient phrase.

For an internal hack week last summer, a colleague and I decided to test our faith in statistical NLP to automatically convert unstructured recipe text into structured data. A few months of on-and-off work later, our recipe parser is now fully integrated into our CMS.

The most challenging aspect of the recipe parsing problem is the task of predicting ingredient components from the ingredient phrases. Recipes display ingredients like “1 tablespoon fresh lemon juice,” but the database stores ingredients broken down by name (“lemon juice”), quantity (“1″) , unit (“tablespoon”) and comment (“fresh”). There is no regular expression clever enough to identify these labels from the ingredient phrases.

#### Example

Ingredient Phrase | 1 | tablespoon | fresh | lemon | juice |

Ingredient Labels | QUANTITY | UNIT | COMMENT | NAME | NAME |

This type of problem is referred to as a structured prediction problem because we are trying to predict a structure — in this case a sequence of labels — rather than a single label. Structured prediction tasks are difficult because the choice of a particular label for any one word in the phrase can change the model’s prediction of labels for the other words. The added model complexity allows us to learn rich patterns about how words in a sentence interact with the words and labels around them.

We chose to use a discriminative structured prediction model called a linear-chain conditional random field (CRF), which has been successful on similar tasks such as part-of-speech tagging and named entity recognition.

The basic set up of the problem is as follows:

Let ${x^1, x^2, …, x^N}$ be the set of ingredient phrases, e.g. {“$½$ cups whole wheat flour”, “pinch of salt”, …} where each $x^i$ is an ordered list of words. Associated with each $x^i$ is a list of tags, $y^i$.

For example, if $x^i = [x_1^i, x_2^i, x_3^i] = [text{“pinch”}, text{ “of”}, text{ “salt”}]$ then $y^i = [y_1^i, y_2^i, y_3^i]= [text{UNIT}, text{ UNIT}, text{ NAME}]$. A tag is either a NAME, UNIT, QUANTITY, COMMENT or OTHER (i.e., none of the above).

The goal is to use data to learn a model that can predict the tag sequence for any ingredient phrase we throw at it, even if the model has never seen that ingredient phrase before. We approach this task by modeling the conditional probability of a sequence of tags given the input, denoted $p(text{tag sequence} mid text{ingredient phrase})$ or using the above notation, $p(y mid x)$.

The process of learning that probability model is described in detail below, but first imagine that someone handed us the perfect probability model $p(y mid x)$ that returns the “true” probability of a sequence of labels given an ingredient phrase. We want to use $p(y mid x)$ to discover (or *infer*) the most probable label sequence.

Armed with this model, we could predict the best sequence of labels for an ingredient phrase by simply searching over all tag sequences and returning the one that has the highest probability.

For example, suppose our ingredient phrase is “pinch of salt.” Then we need to score all the possible sequences of $3$ tags.

$$

p(text{UNIT UNIT UNIT} mid text{“pinch of salt”}) \

p(text{QUANTITY UNIT UNIT} mid text{“pinch of salt”})\

p(text{UNIT QUANTITY UNIT} mid text{“pinch of salt”})\

p(text{UNIT UNIT QUANTITY} mid text{“pinch of salt”})\

p(text{UNIT QUANTITY QUANTITY} mid text{“pinch of salt”}) \

p(text{QUANTITY QUANTITY QUANTITY} mid text{“pinch of salt”}) \

p(text{UNIT QUANTITY NAME} mid text{“pinch of salt”}) \

vdots

$$

While this seems like a simple problem, it can quickly become computationally unpleasant to score all of the $|text{tags}|^{|text{words}|}$ sequences^{**}. The beauty of the linear-chain CRF model is that it makes some conditional independence assumptions that allow us to use dynamic programming to efficiently search the space of all possible label sequences. In the end, we are able to find the best tag sequence in a time that is quadratic in the number of tags and linear in the number of words ($|text{tags}|^2 * |text{words}|$).

So given a model $p(y mid x)$ that encodes whether a particular tag sequence is a good fit for a ingredient phrase, we can return the best tag sequence. But how do we learn that model?

A linear-chain CRF models this probability in the following way:

$$

begin{equation}

p( y mid x ) propto prod_{t=1}^T psi(y_t, y_{t-1}, x)

end{equation}

$$

where $T$ is the number of words in the ingredient phrase $x$.

Let’s break this equation down in English.

The above equation introduces a “potential” function $psi$ that takes two consecutive labels, $y_t$ and $y_{t-1}$, and the ingredient phrase, $x$. We construct $psi$ so that it returns a large, non-negative number if the labels $y_t$ and $y_{t-1}$ are a good match for the $t^{th}$ and ${t-1}^{th}$ words in the sentence respectively, and a small, non-negative number if not. (Remember that probabilities must be non-negative.)

The potential function is a weighted average of simple feature functions, each of which captures a single attribute of the labels and words.

$$

begin{equation}

psi(y_t, y_{t-1}, x) = exp{sum_{k=1}^K w_k f_k(y_t, y_{t-1}, x)}

end{equation}

$$

We often define feature functions to return either 0 or 1. Each feature function, $f_k(y_t, y_{t-1}, x)$, is chosen by the person who creates the model, based on what information might be useful to determine the relationship between words and labels. Some feature functions we used for this problem were:

$$

begin{align*}

&f_1(y_t, y_{t-1}, x) = left{

begin{array}{lr}

1 text{ if } x_t text{ is capitalized and }y_t text{ is NAME} \

0 text{ otherwise}

end{array} right.\ \

&f_2(y_t, y_{t-1}, x) = left{

begin{array}{lr}

1 text{ if } x_t text{ is “1/2” and } y_t text{ is QUANTITY} \

0 text{ otherwise}

end{array} right. \ \

&f_3(y_t, y_{t-1}, x) = left{

begin{array}{lr}

1 text{ if } x_t text{ is “cup” and }y_t text{is QUANTITY} \

0 text{ otherwise}

end{array} right.\ \

&f_4(y_t, y_{t-1}, x) = left{

begin{array}{lr}

1 text{ if } x_t text{ is “flour” and } y_t text{ is QUANTITY} \

0 text{ otherwise}

end{array} right.\ \

&f_5(y_t, y_{t-1}, x) = left{

begin{array}{lr}

1 text{ if } x_t text{ is a fraction and } y_t text{is QUANTITY} \

0 text{ otherwise}

end{array} right.\ \

&f_6(y_t, y_{t-1}, x) = left{

begin{array}{lr}

1 text{ if } y_t text{ is QUANTITY and } y_{t-1} text{is UNIT} \

0 text{ otherwise}

end{array} right.\ \

&f_7(y_t, y_{t-1}, x) = left{

begin{array}{lr}

1 text{ if } y_t text{ is QUANTITY and } y_{t-1} text{ is NAME} \

0 text{ otherwise}

end{array} right.\

end{align*}

$$

There is a feature function for every word/label pair and for every consecutive label pair, plus some hand-crafted functions. By modeling the conditional probability of labels given words the following way, we have reduced our task of learning $p(y mid x)$ to the problem of learning “good” weights on each of the feature functions. By good, I mean that we want to learn large positive weights on features that capture highly likely patterns in the data, large negative weights on features that capture highly unlikely patterns in the data and small weights on features that don’t capture any patterns in the data.

For example, $f_2$ describes a likely pattern in the data (“$½$” is likely a quantity), $f_4$ describes an unlikely pattern in the data (the word “flour” is almost never a quantity) and $f_1$ doesn’t capture a common pattern (the ingredient phrases are almost always lowercased). In this case, we want $w_2$ to be a a large positive number, $w_4$ to be a large negative number and $w_1$ to be close to 0.

Due to properties of the model — chiefly, that the function is convex with respect to the weights — there is one best set of weights and we can find it using an iterative optimization algorithm. We used the CRF++ implementation to do the optimization and inference.

#### Results

Our model got $89$% sentence-level accuracy when trained on $130,000$ labeled ingredient phrases from the database. The data was too noisy for automatic evaluation, so we evaluated sentence-level accuracy by hand on a test set of $481$ examples.

Below are some examples of where we do well, where we do poorly, and where there is no clear correct answer. Recall that we are trying to predict NAME, UNIT, QUANTITY, COMMENT and OTHER.

**Truth:**1QTgarlic cloveNA, minced ( optional )OT

**Guess: **1QTgarlicNAcloveUN, minced ( optional )OT

This example is confusing for both our human annotators and the algorithm. We probably want “clove” to be part of the ingredient name instead of the unit, but we see both variations in our training data.

**Truth:**2QTred onions , peeled and dicedNA

**Guess: **2QTred onionsNA,OTpeeled and dicedCO

Here is an example of the CRF correcting a human annotator’s error. Peeled and diced should be part of the comment.

**Truth:**4QTtablespoonsUNmelted nonhydrogenatedOTmargarineNA, melted coconut oil or canola oilOT

**Guess: **4QTtablespoonsUNmelted nonhydrogenatedOTmargarineNA, meltedCOcoconut oilNAor canolaOToilNA

This ingredient phrase contains multiple ingredient names, which is a situation that is not accounted for in our database schema. We need to rethink the way we label ingredient parts to account for examples like this.

**Truth:**1QTbunchUNscallionsNA,OTtrimmed and cut into 1/4-inch lengthsCO

**Guess: **1QTbunchUNscallionsNA,OTtrimmed and cut into 1/4-inch lengthsCO

And sometimes everyone gets it right!

#### Takeaway

Extracting structured data from text is a common problem at The Times, and for 164 years the vast majority of this data wrangling (e.g. cataloging, tagging, associating) has been done manually. But there is an ever-increasing appetite from developers and designers for finely structured data to power our digital products and at some point, we will need to develop algorithmic solutions to help with these tasks. The recipe parser, which combines machine learning with our huge archive of labeled data, takes a first step towards solving this important problem. Email me at erica.greene@nytimes.com if you’d like more details about this project.

BIO tagging so there are $(|text{tags}|* 2) ^ {|text{words}|} $ sequences.