Summarization

FF04 · 2016

This is an applied research report by Cloudera Fast Forward. We write reports about emerging technologies, and conduct experiments to explore what’s possible. Read our full report about Summarization below, or download the PDF . The prototype for our report on Summarization is called Brief. The prototype allows one to visualize the process of summarization over different types of documents. We hope you enjoy exploring it.

Contents

1. Introduction

Making language computable has been a goal of computer science research for decades. Historically, it has been a challenge to merely collect and store data. Today, however, it has become so cheap to store data that we often have the opposite problem. Once data has been collected, it’s now a challenge to meaningfully analyze it.

Large amounts of unstructured text can be resistant to analysis

We’ve seen significant progress in infrastructure for using data effectively in the last half-decade. One under-recognized constraint is that this hasn’t applied to all types of data equally. Unstructured text, in particular, has been slower to yield to the kinds of analysis that many businesses are starting to take for granted. Rather than being limited by what we can collect, we are now constrained by the tools, time, and techniques to make good use of it.

We are beginning to gain the ability to do remarkable things with unstructured text data. This progress is the result of several trends. First, the use of neural networks and deep learning for text offers the ability to build models that go beyond just counting words to actually representing the concepts and meaning in text quantitatively. Second, the cost of the computation necessary has declined dramatically, making it easier to apply these neural models as well as larger versions of classical models. Finally, more and more useful data is available to use to train these systems.

This report focuses on text summarization, or taking some text in and returning a shorter document that contains the same information. We address both single document and multi-document summarization, with current best approaches and prototypes presented for each. However, summarization should also be thought of as a gateway application for these quantified representations of text.

Advances in the quantified representation of text make it possible to generate summaries from multiple documents according to user needs

Summarization is an active area of research, and there are many immediately useful applications. For example, imagine summarizing thousands of product reviews, a long news article, or even taking millions of customer bios to find the various clusters and personas automatically.

Summarization is also the kind of problem where the simple case, summarizing a single article, simplifies a task that many people already do. However the complex task, summarizing potentially tens or hundreds of thousands of articles at once, represents an entirely new capability.

This report walks through the landscape of summarization algorithms by first introducing a simple explanatory algorithm, then a proof of concept multi- document summarization system, and finally a fully-realized extractive summarization prototype. These examples start simple and eventually demonstrate the breakthrough capabilities realized by the application of sentence embedding and recurrent neural networks to capturing the semantic meaning of text. These capabilities are poised to transform the way we process language in the next few years.

2. Extractive Summarization

The simplest way to summarize a document is to extract passages directly from the document, and use these to build a summary. This is called extractive summarization, because the summary is built by selecting passages, rather than writing new text.

But how many passages should you select? Which passages should you select? How do you stitch them together or otherwise present them to make a useful summary? And how do you know whether your summary does a good job of retaining the meaning of the original document? Extractive summarization might seem like a simple idea, but these questions turn out to be subtle.

This chapter will give you an overview of how extractive summarization works. It begins with a discussion of how to summarize a single document, before extending the approach to multiple documents.

Single-Document Summarization

To build an extractive summary of a single document, you extract a subset of the document. The extract might be a few sentences, which need not (and often should not) appear next to each other in the original document.

To qualify as a summary, the extracted subset must be shorter than than the original document, but should still capture as much as possible of its original meaning — particularly any ideas that the summary user is likely to consider important.

Extractive summaries are made up of sentences selected from the original document

A General Framework

Extractive single-document summarization has a long history. The specific approaches we’ll describe in detail, which range from naive heuristics to the latest neural network techniques, might seem very different, but at a high level, they all consist of three steps:

  1. Vectorize — Turn each sentence into a sequence of numbers (a vector) that can be analyzed and scored by a computer.

  2. Score — Rate each sentence by assigning a score or rank to its vectorized representation.

  3. Select — Select the best sentences.

Vectorize, score, and select

Vectorize

Figure 2.3: Vectorize detail
Figure 2.3: Vectorize detail

Throughout the vectorize-score-select framework just described, the document is treated as a list of short passages, each of which is a candidate to be extracted and added to the summary. These short passages are often simply sentences. In a more complex system, or a system designed to be applied to longer documents, the passages might be sub-sentence features such as clauses, or larger features such as paragraphs. For simplicity, we’ll refer to the passages as sentences.

Vectorization means taking a string of text and turning it into a vector of numbers that a computer can analyze. The aim is to represent a piece of text in a quantitative way, but to do this in a way that retains as much meaning as possible. You can think about the output of this step as a kind of table of contents, or index, for the original text. The index tells you what words or topics the text contains, but it generally cannot be used to reconstruct the text.

The most common way to vectorize text is the bag of words model. First, you define a vocabulary. This might be all the words in a dictionary, all the words in your particular set of documents, or all the possible tokens in some constrained domain (e.g., the four bases in DNA, or all the animal emojis). You can then vectorize a sentence by building a vector where each element corresponds to a word in the vocabulary, and its value is the number of times that word occurred in that particular sentence.

cat sat mat dog bites man

the cat sat on the mat

cats and dogs

dog bites man

man bites dog

Table 2.1: An example of the bag-of-words model.

Vectorization is the first step in almost all natural language processing (NLP) problems, not just summarization. And the bag-of-words model is such a common approach that, to an extent, it is synonymous with vectorization. Later we’ll see two more advanced approaches: topic modeling and sentence embedding. But in the context of summarization, another approach to vectorization is often effective: heuristics.

In the heuristic approach, rather than saying whether words in the vocabulary do or do not occur in a sentence, you record other attributes or metadata of the sentence. These attributes are features that you expect to be useful in figuring out how good a summary a sentence might be. For example, you could record the position in the document (perhaps you think earlier sentences tend to be better summaries), or the number of proper names (sentences that mention these contain a lot of information). Classical NLP techniques can be used to record more complex attributes of each sentence, such as entities (people, places, events) and ideas (topics, sentiment, time, quality)[1].

Score

Figure 2.4: Score detail
Figure 2.4: Score detail

Using the vectorized representations, you assign each sentence a score (or, equivalently, a rank) that says how useful its inclusion in a summary would be. The obvious question is: how do you define the scoring function?

If the vectors are bags of words, one approach to scoring is Luhn’s algorithm, which we describe in detail in A Simple Implementation.

When the vectors record heuristic attributes, those heuristics have presumably been chosen because of an expectation that the attributes correlate with how “good” a sentence is. For example, if you know that earlier sentences are better summaries, you can assign them higher scores. The heuristics therefore imply the scoring method. Or, as discussed in Fundamental Limitations of Heuristics, you can use machine learning to learn from example summaries where in documents the best sentences tend to be.

We explain more complex scoring methods in Summarization Using Topic Modeling and Language Models with RNNs.

Select

Figure 2.5: Select detail
Figure 2.5: Select detail

The final step is to select a small number of high-scoring sentences. These are the summary. Depending on the kind of summary you want, and the details of how the scores were assigned, you might simply choose the five sentences with the highest scores. Or, to avoid repetition in the summary, you might penalize the choice of multiple sentences that cover the same concepts. And to constrain a summary by word count, you might penalize long sentences, which take a long time to say relatively little.

Constructing a summary that flows as a standalone document may require post-processing to stitch together the selected sentences into a sequence that makes sense. You may also need to replace pronouns with their referents (e.g., replace “he gave a speech” with “Barack Obama gave a speech”).[2] Our Natural Language Generation report provides approaches to this.

The vectorize-score-select framework is fairly straightforward, but worth keeping in mind as you think through any approach to summarization because it will help you to assess technical innovation. The earlier in the framework a change is made, the more significant it is. The vectorization step is fundamental, and improvements to it have far-reaching consequences that can yield huge gains in summary quality. There is far less scope to transform the perfomance of a system by altering the selection step at the end of the framework. Two new approaches to vectorization, Latent Dirichlet Allocation and sentence embedding, lead us to write this report. They yield huge improvements over heuristics.

A Simple Implementation

Luhn’s Algorithm

This section presents a simple example of the vectorize-score-select framework: Hans Peter Luhn’s “The Automatic Creation of Literature Abstracts"[3]. Luhn’s is a classic summarization algorithm. While it is fragile and limited, it serves as a great explanatory tool because it is simple enough to understand without any code.

Vectorize

Luhn’s algorithm begins with the simplest vectorization imaginable: bags of words. As described previously, each sentence becomes a vector, where each position in the vector records the number of times the corresponding item in the vocabulary occurred in that sentence (see Table 2.1).

Score

Luhn’s intuition was that a word that appears many times in a document is important to the meaning of that document. Therefore, a sentence that contains many of the words that appear many times in the overall document is itself highly representative of that document. For example, if a document’s most significant words are “gene” (which appears 20 times) and “sequence” (which appears 17 times), then this implies that the sentence “The sequence was present in the gene” is probably a useful one to extract in the summary.

Figure 2.6: In the scoring step of our demo of Luhn’s algorithm, the four most frequently occurring words are highlighted on the right and within the document. You can experiment with the demo yourself at http://www.fastforwardlabs.com/luhn
Figure 2.6: In the scoring step of our demo of Luhn’s algorithm, the four most frequently occurring words are highlighted on the right and within the document. You can experiment with the demo yourself at http://www.fastforwardlabs.com/luhn

Luhn’s scoring step depends on this idea. First, sum the vectors for each sentence. This gives the word count vector for the entire document, which you use to make a list of “significant words”: the most frequently occurring non-stopwords in the entire document.[4] With this list of significant words, you can now score each sentence. The score is the number of significant words that occur in the sentence.

Figure 2.7: The top-scoring sentences are highlighted in yellow
Figure 2.7: The top-scoring sentences are highlighted in yellow
Select

It’s then a case of selecting the highest-scoring sentences. These sentences can be selected to create a summary of the desired length, or to satisfy other criteria. The selected sentences are the summary.

Enhancements with Machine Learning

Luhn’s word count approach can be enhanced with heuristics. For example, in certain kinds of documents, you might know from experience that sentences that appear early in the document, are similar to the title of the document, and contain uppercase letters, proper nouns, numbers, or cue phrases like “in summary,” “the most important,” and “in particular” are better summaries. You could incorporate this knowledge into Luhn’s algorithm by adding them as heuristics.[5] You might do this by including sentence position, a quantitative measure of similarity to the title, or the presence of cue phrases in the vector, and then adjusting the sentence scores by hand based on these attributes.

Even better, supervised machine learning can be used to treat these heuristic attributes as features. A supervised machine learning approach takes a list of features, or factors that may indicate that a sentence is a good summary, and trains a model over those features to learn which ones are relevant and which ones aren’t. For example, rather than simply imposing a heuristic that early sentences should score higher, you can add sentence position as a feature in a supervised machine learning model, and use training documents to learn where in a document the best summary sentences are likely to be found.

This reduces the need for the engineers of the system to understand and have experience with the documents being summarized. They no longer need to know from experience that, for example, sentences containing numbers are better summary candidates. The algorithm can infer this from the training data.

Fundamental Limitations of Heuristics

Machine learning does not remove the need for domain expertise altogether. Engineers still need to know which attributes might help. Supervised machine learning will tell you whether sentences with numbers make better summaries, but it won’t give you the idea to try that feature in the first place. The difficult, expensive process of inventing, designing, and implementing features for the model to use is called feature engineering.

When engineers have the domain expertise required to identify features that work for a specific kind of article and training data is available, then heuristic and machine learning approaches can be used in conjunction. However, in general we have found that the the heuristic approach is limited and fragile. People writing heuristics tend to miss edge cases!

Most importantly, these systems tend to be domain-specific since the heuristics were imposed with a particular kind of document in mind, or learned from a particular training set. This may, for example, restrict summarizations to scientific articles or emails, or to documents in English. It may be possible to retrain a system for a new domain, but the original decision to take the heuristic approach imposes a technical debt that must be paid each time the system is applied in a new context.

This cost may be negligible if the system only needs to work in one narrow domain, and the heuristic approach works well in that domain. However, the heuristic (or feature engineering) approach has a more fundamental limit: even with hand-engineered features, the vectorized representation of the document often does a poor job of capturing the document’s meaning. For example, as shown in Table 2.1, “dog bites man” and “man bites dog” have identical vectors in a bag of words, but they mean very different things!

This means that the score and select stages of the vectorize-score-select framework are simply missing the information they need to do the best possible job. The next four chapters of this report are dedicated to describing the advances made in the last few years that have transformed our ability to turn text into numbers a computer can work with, while retaining as much of the document’s meaning as possible. Summarization Using Topic Modeling is about topic modeling, and Latent Dirichlet Allocation in particular. Recurrent Neural Networks - Background, Prototype - Brief, and Recurrent Neural Networks - Challenges are about sentence embeddings and recurrent neural networks.

Multi-Document Summarization

While summarizing a single document is useful and interesting, many more use cases open up when we consider the ability to summarize multiple documents. It might be convenient to read only a few sentences and get the point of an article, but it would be superhuman to read a few pages and get the point of tens of thousands of documents.

Figure 2.8: An extractive summary can be drawn from multiple documents
Figure 2.8: An extractive summary can be drawn from multiple documents

You can summarize multiple documents by simply summarizing the concatenation of the documents as if it were a single document.[6] For documents of similar length, in situations where the summary should in some sense represent the “average” document, this naive approach may be sufficient. However, if the summarization algorithm depends on the ordering of sentences (e.g. it uses recurrent neural networks or simply a heuristic to favor early sentences) then the order in which the documents are concatenated will affect the resulting summary, which is usually a bad thing. Moreover, when the documents are diverse and you want to capture rare but significant ideas, more care may be required.

For example, imagine a group of product reviews where the reviews are mostly positive. In the naive approach of concatenating the documents, the summarization algorithm might spot this majority opinion, and therefore select extracts that are positive. But perhaps the handful of negative reviews identify a rare but serious flaw in the product (Ethics). Any one of the three steps in the vectorize-score-select framework could be adjusted to ensure the summary captures these unusual ideas. The details will necessarily be somewhat dependent on the type of documents you are summarizing and how the summary will be used. The next chapter, Summarization Using Topic Modeling, looks at this decision process while also introducing topic modeling.

3. Summarization Using Topic Modeling

In this chapter we describe a simple proof-of-concept summarization method that uses topic modeling for vectorization. We apply it to product reviews on Amazon. This proof of concept follows the vectorize-score-select framework described in A General Framework. In that introductory chapter, we presented the bag-of-words and heuristic approaches to vectorization, and discussed why they are often fragile or overly simplistic.

Our topic modeling-based summarization strategy avoids these problems by first learning “topics” from a corpus of sample documents, and then vectorizing each sentence in terms of these topics. Effectively, we use topic modeling to learn, in a data-driven way, how to vectorize the text while retaining crucial semantic information. But before we look at summarization and our proof of concept, we need to develop an understanding of topic modeling itself.

Topic Modeling

Topic modeling is an unsupervised machine learning technique that compresses documents while retaining semantic content and interpretability. Because it is unsupervised, the engineer does not have to read or understand the documents. The documents don’t need to be labeled in any way. Given the stack of documents and a choice of the number of topics, the algorithm identifies groups of words that co-occur. These groups are called topics.

Topics are simply groups of words that co-occur so they may not be exactly what you expect. Empirically, these groups of words are often useful, but there are no guarantees that particular topics will be found. For example, when training on a corpus of news articles, we may not get topics that correspond precisely to the sections of a newspaper: “business,” “arts,” “politics,” “world,” etc. See The Number of Topics and The Meaning of Topics for more on this.

Figure 3.1: Topic modeling learns topics by looking for groups of words that frequently co-occur
Figure 3.1: Topic modeling learns topics by looking for groups of words that frequently co-occur

Once the topics have been learned, any new document can be expressed as a vector where each element can be thought of as the weight of that topic in that document. Rather than writing out the document as raw words (or as a bag-of-words vector), you can represent it as a short vector of topic weights, e.g., 50% words from the the politics topic, 30% words from the oil topic, and 20% from the Obama topic. In that sense, topic modeling is a form of lossy compression. The compressed version can be used to infer some the original document’s essential features (e.g., “it is about politics, oil, and Barack Obama”), but it can’t be used to reconstruct the entire document.

Figure 3.2: After training on a corpus, a topic model can be applied to a document to determine the most prominent topics in it
Figure 3.2: After training on a corpus, a topic model can be applied to a document to determine the most prominent topics in it

The topic model we use in our summarization proof of concept is called Latent Dirichlet Allocation (LDA).[7] Learning topics in an LDA model is an instance of a class of problems called posterior inference. Broadly speaking, there are two approaches to solving posterior inference problems: Gibbs sampling and Variation Expectation Maximization (VEM). While both these methods are outside the scope of this report, there are many fantastic references that we recommend reading for the details.[8]

Feature-complete and well-tested open source implementations of LDA that largely conceal the details of the posterior inference from the user are widely available. We used Allen Riddell’s Python package lda,[9] which implements a Gibbs sampler and uses some C under the hood for speed. While we were writing this report, the popular and comprehensive Python library scikit-learn added a VEM LDA solver.[10] The scikit-learn implementation also includes online learning, allowing it to learn topics from a stream of data. Other popular LDA libraries include the one in the Java statistical language processing toolkit MALLET[11] and an R package.[12]

The ability to express a document as short vector, each element of which has interpretable meaning, makes topic modeling a powerful technique to apply in many more contexts than text summarization. Topic modeling can be:

Review Summarization with LDA

With LDA in our toolbox we are ready to begin the task of summarizing Amazon product reviews. The most popular products on Amazon have thousands of reviews, each of which is at least a few sentences long. This means that, taken together, the reviews of a popular book can be longer than the book itself. Very few visitors will have the time or patience to read more than a couple of reviews, or scan more than the 10 or so presented on the main product page. Our proof of concept constructs an extractive summary of these reviews, so shoppers can save time. To do this, we will create topic models at both the corpus and the sentence level to understand the variety of opinions in the reviews.

Learning the Topics

The key advantage of topic modeling over heuristic-based approaches to summarization is that we learn, in a data-driven manner, a way to vectorize text that is tuned for the kind of documents we’re dealing with and captures some of their meaning. In this way, we sidestep the issues discussed in Fundamental Limitations of Heuristics. Moreover, the vectorized representation we learn is interpretable, in that it is simply the topic weights of the text (e.g., 50% words from the politics topic, 30% from the oil topic, and 20% from the Obama topic).

The preprocessing step is to learn the topics exhibited by our kind of documents, by analyzing a large corpus of them. Once learned, these topics will be the lens through which we view every document or sentence. The documents we use to learn them don’t need to include the precise documents we want to summarize, but they should be from a similar domain. We used a huge archive of Amazon product reviews.[14] To simplify training and interpretation, we trained separately with reviews from each product category. We used 100 topics for each model (The Number of Topics), and the training data set was never less than 1 million reviews (in some cases it was much more).

Sample topics showing word probabilities (in percent)
Topic 1 Topic 2 Topic 3 Topic 4

action

<1

<1

<1

20

awesome

40

<1

<1

10

bad

<1

40

10

<1

book

5

5

5

5

easy

<1

<1

<1

<1

evil

<1

10

40

<1

excellent

20

<1

<1

<1

exciting

<1

<1

<1

30

good

30

<1

<1

<1

movie

<1

5

<1

30

quick

<1

<1

<1

<1

terrible

<1

40

<1

<1

terrifying

<1

<1

40

<1

Table 3.1: Sample topics showing word probabilities (in percent).

The previous table shows a schematic illustration of four topics found in book reviews, for a very small vocabulary of 12 words. This illustrates that the topics are in fact probability distributions over words: in each topic, each word has a probability (the columns add up to 100%). The topics have arbitrary IDs, but if you squint you can perhaps identify themes. For example, Topic 1 seems to contain positive words, and Topic 2 assigns high probability to negative words. Topic 3 is a horror topic (note that the word “bad” is associated with both Topic 2 and Topic 3). Topic 4 refers vaguely to excitement, perhaps in the context of a movie adaptation.

Having examined a schematic topic distribution, let’s now look at some of the real-world topics we learned when we fit an LDA model with 100 topics to 2 million Amazon book reviews. The topics are probability distributions over an enormous vocabulary so instead we show the five words with the highest probability in a handful of the 100 topics. Again, the numbers are arbitrary topic IDs:

*1* fantasy world magic witch harry
*9* philosophy religion ideas philosophical
*15* movie read film better seen
*21* translation work poetry literature volume
*25* dark horror novel scary ghost
*35* henry mary king england historical
*37* american white black people culture
*54* life world mind human experience
*70* love great read loved amazing
*79* good fun easy quick interesting
*99* condition price new amazon used

The themes of these topics are clear. This is the lens through which the summarization algorithm will view each product.

Constructing the Summary

In order to summarize the reviews of a product, we apply the summarization algorithm to the concatenation of those reviews (the metareview):

  1. Vectorize — Given what we learned from the training corpus, determine which topics occur in each sentence of the metareview and with what weights. The computation required for this step is built into any LDA library. The topic distribution of a sentence might be something like 70% words from the horror topic, 20% from the praise topic, and 10% from the poor customer service topic.

  2. Score — We know what topics are in each sentence. We now need to assign scores on this basis. We do this by first determining the topic distribution of the entire metareview. This tells us what people talk about when they review this product, and how often. We score the individual sentences by how much they are dominated by the one topic that most dominates the metareview. Sentences that are themselves dominated by that topic are representative of a commonly expressed opinion. These are good candidates for a summary.

  3. Select — Now we select the highest-scoring sentence or sentences.

We repeat steps 2 and 3 for the next most dominant topic of the metareview, and perhaps a few more. The number of topics considered dominant and the number of sentences it is appropriate to extract for each will depend on the product and problem domain.

Topic 1 Topic 2 Topic 3 Topic 4

Weight

30

10

50

10

Table 3.2: Topic weights in a document (as percent).

Topic 1 Topic 2 Topic 3 Topic 4

Sentence 1

30

20

30

20

Sentence 2

50

20

10

20

Sentence 3

10

40

20

10

Sentence 4

30

10

40

20

Sentence 5

10

60

10

20

Table 3.3: Sentences that score highly in the document’s dominant topics are selected. In this case, sentences 2 and 4 would be selected for the summary.

The score and select strategies used in our proof of concept are intended to be simple examples that allow us to demonstrate the approach, without getting bogged down in details unrelated to topic modeling. Nevertheless, we’ll talk about some of the ways in which they compromise the quality of the summary, and how those might be avoided, in Sentence Selection.

Visualization

Having applied the preprocessing step to a representative corpus of book reviews and learned 100 topics, the summarization algorithm next treats the reviews of a product as a single document, and determines which of those topics most dominate the reviews. It then selects sentences that are representative of those dominant topics.

How exactly to present these results is a product design question that will depend on the application, user, and data. Rather than get into this question, we demonstrate the simplest approach: we show three representative sentences for eight dominant topics in reviews of a sample product, the book The Strain by Guillermo del Toro and Chuck Hogan.

Figure 3.3: Representative topics from The Strain
Figure 3.3: Representative topics from The Strain

The algorithm selects sentences that are both positive and negative, sentences that describe the book’s genre and tone, and sentences that describe its plot in a matter-of-fact way. Together, these 24 sentences are a useful summary of the thousands of reviews that have been written of this book.

Topic Modeling Challenges

In the remaining sections of this chapter we discuss the challenges that are particular to the topic modeling approach to summarization. Later in this report (Building a Summarization System) we compare and contrast these challenges to those of simple approaches such as Luhn’s algorithm, and the recurrent neural network approach we use in our prototype.

Training Data

The size of the corpus necessary to learn a topic model is a subject of current research, and related in a complicated way to the number of topics you fit and the lengths of the documents it contains. The more topics you want to learn, the more documents you need, and the longer the documents the better. As a rule of thumb, we had excellent results with around 10,000 example reviews per topic. We’ve seen reasonable results with as few as 200 documents per topic, however.[15] If you intend to fit 100 topics, you need at least 20,000 documents, but will get better results with 1 million.

The training documents don’t need to include the documents you want to summarize, but they should be similar. A system trained solely on books will do a poor job of summarizing reviews of pet supplies, because it will be looking for topics related to character, story, and writing style, rather than smell, build quality, and breeds. That said, a system trained on pet supply reviews and book reviews, if there are enough examples of both and the system is allowed to use enough topics, should do a good job on products from either category.

A particularly useful attribute of topic modeling is that it is unsupervised. In the context of our summarization task, a supervised method would need both original documents and hand-written summaries. But topic modeling doesn’t need the hand-written summaries. It finds topics without them.

The Number of Topics

Topic modeling is unsupervised, but it does need one crucial piece of information to work: the number of topics you want to model in the data.

In an ideal world, the number of topics you tell the algorithm to look for would be driven solely by a vague but intrinsic property of the corpus: the number of topics that are actually present in the documents. The meaning of “topic” depends on the context. For example, if you are analyzing legal documents, then there will be many distinct law-related topics (tort, criminal, international, etc.). But if you’re summarizing news articles, then you might want to merge these into a single law topic. We’ve seen people use as few as half a dozen topics, and as many as 1,000.[16]

But in the real world, the number of topics you can fit is constrained not only by the intrinsic properties of the corpus, but also by data, product and engineering limits.

First, how many documents do you have? While you might feel justified in assuming your documents contain 1,000 topics, you won’t get good results if you only have 10 example documents. There isn’t enough information in those documents, and the topics will be meaningless.

Second, how many topics can the end user of the summary deal with? This depends on the use case, and is less of an issue with summarization, where by definition we are only interested in the dominant topics. But if you intend for the user to see all the topics — which might be the case when topic modeling is applied to corpus exploration, for example — any more than a few topics will be overwhelming.

Finally, what computing resources are available? The more topics you use, the more expensive it will be to train your model, and the more expensive it will be to use this trained model to determine the topic distribution of a new document.

Computation

Training a topic model with LDA takes minutes or hours, rather than the days or weeks required by a more complex algorithm. With one million reviews, 100 topics, and Allen Riddell’s Python package lda, we trained on a single core of a commodity PC in a few hours. If you have a much larger dataset then we recommend that you switch to an online or distributed solution, since the data likely cannot be held in memory on a single machine.[17] Online LDA models also allow the topics to evolve as new data arrives, which can be useful if the topics in your corpus evolve with time.

Once the topics have been learned, the summarization step itself is much quicker. For a well-reviewed product with thousands of reviews, our naive and unoptimized approach took around a minute to determine the topic distribution of every sentence in every review. If a one minute delay is not acceptable in a particular product, the summaries (or the per-sentence topic distributions on which they depend) can be precomputed for each product in a batch process, and then recomputed when a new review is written.

Sentence Selection

Our proof of concept has an extremely simple sentence selection strategy: for each of the N topics that dominate the metareview, we select the M sentences most dominated by that topic. This approach has a couple of problems.

The sentences selected for a particular topic can lack diversity (e.g., we saw lots of sentences saying “I recommend it,” give or take a stopword). There are many possible solutions to this problem. You could, for example, penalize short sentences, because long sentences are less likely to be similar. Or you could select sentences greedily, and incorporate a penalty for similarity to the previously selected sentences.

The number of dominant topics to find in each metareview, N, and the number of sentences to find for each dominant topic, M, are fixed constants we chose by hand, that prevent the user from exploring the reviews. This problem can be solved by user interface changes that implicitly allow the reader to choose N and M. For example, you could present a list of the top few topics, and allow the reader to select the topics from that list. And for each topic, you could allow the reader to request a few more representative sentences.

The Meaning of Topics

Unsupervised machine learning systems identify structures in data, but on their own they have no way to name or describe the structures. In the context of topic modeling, this means that our topics don’t come with human-readable labels (see, for example, Table 2.1). Throughout this chapter we’ve used terms like “the politics topic.” This makes it easier to explain topic modeling, but obscures this important limitation.

One approach to “labeling” a topic is to use a few words with high probabilities. This is the approach we take in our proof-of-concept. For example, if the most probable few words of a topic are “dark,” “disturbing,” “evil,” “ghost,” and “scary,” that indicates to most users a coherent topic. If the model is precomputed and unlikely to change, and the number of topics is small, you could label topics by hand by examining these top few words. With a hundred or so topics this may be tractable, although there can be overlap and redundancy in the topics (e.g., more than one topic that assigns high probability to words related to horror).

4. Recurrent Neural Networks - Background

Topic modeling, discussed in the previous chapter, is a fast, relatively simple and highly interpretable approach to text vectorization and summarization. But using the more recently available language techniques in neural networks, we got even more coherent and useful summaries of a single document. The neural network approach to summarization is the subject of the next three chapters, which cover the conceptual background, our summarization prototype, and the challenges of working with neural networks.

To complete the first step of our vectorize-score-select framework (A General Framework)) we could use traditional NLP methods to tag the entities and parts of speech in a sentence, but traditional NLP’s principal aim is to label a sentence in a way that is meaningful to humans. By contrast, our aim is to represent text in a way that is optimal for an algorithm, where we define “optimal” in relation to the task the algorithm performs. In our case, we want text representations that enable an algorithm to do a good job of extracting sentences to build a document summary.

Why Text Is Hard

In our previous report, Deep Learning: Image Analysis, we used convolutional neural networks to identify the objects in images. The success of neural networks in computer vision problems hints at their tremendous power to create machine-friendly representations of images that capture their meaning. But it turns out that it is harder to represent the meaning of text than of images. Simple labels often work for images (“cat,” “dog,” “sunset,” etc.), but text has unique subtleties which mean that tiny changes can have a huge effect on its meaning. Examples of this include the sentence The panda eats, shoots, and leaves versus The panda eats shoots and leaves and the Twitter bot Kenosha Kid.

Images can be resized or cropped without changing meaning, which means that an algorithm that analyzes images can count on a fixed-size input. But text (and other variable length sequences like stock prices or the frames of a video) cannot be arbitrarily truncated or shrunk. The analysis and representation of these types of sequences has always been a great challenge in the field of neural networks, since a classical feed-forward network can only accept a fixed-size input and has no internal state to inform current classifications based on previously seen data.

Feed-forward neural networks have no internal state, making them a bad fit for sequential data of variable length

You might think that we could use input networks to classify videos, for example, since videos are just sequences of individual images. Naively, there are two main ways to attempt this:

  1. Classify each individual frame and somehow combine all of the classifications for a view into the entire video.

  2. Create a new image classification model that takes in a fixed number of frames and combines them into a classification.

The first solution has the problem that each frame is treated as if it were completely independent of the others. For example, if a video showed how to make a sandwich, each frame would be classified with what it showed (“bread,” “cheese,” “lettuce”), but all of these facts would not be taken together to make a holistic claim about the video. The second solution attempts to solve this problem, but the fixed number of frames doesn’t scale as the length of the videos changes. Furthermore, it cannot deal with different timescales in various videos (for example, in a cooking show ingredients could be shown for a couple of seconds each, but in an infomercial the product is shown for the majority of the video).

Recurrent neural networks (RNNs) solve this problem in a more fundamental way: they maintain an internal state between inputs. The RNN can therefore remember context and use this to inform new decisions. This allows it to accept arbitrarily sized input and generate an arbitrarily sized output (for example, if we want to generate a sentence as the output of our model, we may not know in advance how long this sentence will be).

A simple recurrent neural network maintains an internal state by feeding the last output in with the new input

With a recurrent neural network, we go back to feeding the model one frame at a time, but the model’s internal state allows it to remember the previous frames and use that information in future. This internal state can be designed in such a way that the network is able to remember aspects of data seen long ago — much further back than we could enable by sending sets of frames at once.

Overview of RNNs

In the early days of RNNs, an internal state was achieved by simply feeding in the previous output of the network as a second input. This functionality is similar to a second-order Markov chain: the input is understood using the previous output.

A simplified, rolled up, version of the RNN diagram above, to be compared to the more complex networks explained below

More complicated structures for RNNs have existed for a long time, but until recently we lacked an algorithm to train them effectively. Just as feed-forward neural networks required backpropagation to realize their potential, RNNs required an algorithm called backpropagation through time (BPTT).[18] However, a complication first discovered in 1991 called the vanishing gradient problem made it hard to apply BPTT to recurrent networks. This problem slowed training to a standstill before the model could reach a reasonable solution.[19] However, the algorithmic breakthrough of BPTT was not enough: we also needed to find specific model structures to overcome this problem.

Model Structures

One of the more interesting structures that can be effectively trained with BPTT is the Long Short-Term Memory (LSTM) architecture.[20] As the name implies, this model not only takes in the previous output but also contains an internal state that essentially allows it to have a long-term memory, useful for seeing long-term dependencies — when understanding language, for example, sometimes the previous word is important, but other times the most important indicator of the next word happened many words earlier. It is important to note that this memory is not simply a copy of the previous input; like most neural methods, LSTM learns to create a low-dimensional representation of the data that compactly stores needed features.

An LSTM uses two internal states

Once short- and long-term dependencies were introduced into RNNs, their use became widespread and research in this field skyrocketed. Simpler mechanisms were designed that avoid the vanishing gradient problem, such as the gated recurrent unit (GRU)[The vanishing gradient problem was eaten by a GRU.], to gain the modeling strength the LSTM model achieved without so much computational complexity.

The GRU can be as effective as the LSTM model with less complexity

These new model structures have been used for a wide range of applications. One that we will be particularly interested in is a layered use of LSTM’s called the encoder-decoder model (covered in more detail in Where Do Embeddings Come From?).

Algorithmic Breakthroughs

In addition to different RNN structures, techniques have also been developed that can be added to existing structures to make them more flexible. For example, attention models learn to use only relevant parts of the RNN’s internal state for a particular input.[21] Neural attention models were first created for non-recurrent networks, but their use in recurrent networks has made them an essential tool in the field. While the details of this method are out of scope for this report, generally these attention schemes allow the model to select the parts of its memory that will be useful when trying to process a new piece of data. Like the regularization techniques discussed in Deep Learning: Image Analysis (page 27) this method allows the model’s memory to be much more robust, since different parts of the memory can be specialized for different uses.[22]

The role of attention models highlights one of the key differences between RNNs and classical feed-forward networks. The primary strength of feed-forward networks is their ability to build and select features from their input. In fact, many feed-forward models can be thought of as machine-learned feature extractors with logistic regression as a final classification step. With RNNs, on the other hand, we have to shape how the model extracts features from its inputs in addition to how it will remember and interpret these features given more information.

Furthermore, it is important to realize that recurrent neural networks are still very new. Just as attention models were first designed for normal feed-forward networks and were later adapted to recurrent networks, more and more techniques that have already become standard in the field of feed-forward neural networks. For example the dropout algorithm which was important and reducing training time and the chance of over fitting for feed forward neural networks is just now getting it’s theoretical underpinnings for recurrent neural networks.[23]. This all points to the field of recurrent neural networks benefiting from the same research leaps and bounds that the feed forward and convolutional neural network world had several years ago.

Language Models with RNNs

A particularly exciting aspect of RNNs is their ability to create smart embeddings using unsupervised training. This is an example of the kind of vectorization we discussed in Vectorize with the bag of words approach. A function that reduces the size of a piece of data while still maintaining some relevant information can be considered an embedding. A smart embedding, or language model in the case of text, goes further: the vectorized representation contains more easily accessible information than the raw data.

For example, with classical feed-forward neural models, we can encode images in a way that represents not just the raw pixel values but also semantic information[24] This smart embedding of images is what powers Google’s similar image search: given an image, it finds the semantic hash and then finds more images with similar hash. In essence, we have taken a large piece of data — an image — and reduced it to a smaller representation that contains information about the original image’s content and style (things that were not easily accessible from the raw image data itself).

These sorts of embeddings are incredibly useful for many reasons. Of particular note is their ability to compactly contain salient information about the data they represent so that a wide variety of alternative tasks can be accomplished with minimal computation. For example, a good image embedding probably has the salient information to do object labeling, similar image scoring, and style extraction with equal ease.

Recurrent neural networks allow us to create the same type of embeddings with words and other forms of sequential data. This generally happens using a combination of two methods: skip-gram training, which helps us learn how to contextualize data, and encoder-decoder models. With these generalized language-level embeddings we can perform all sorts of analyses on text that depend on its semantic meaning. In this sense, the embedding forms a language model similar to those of classical NLP. But instead of explicitly tagging parts of speech and extracting entities, our model automatically encodes that information in a machine-friendly way.

Where Do Embeddings Come From?

Encoder-decoder models are more easily understood by looking at one of their most important applications: language translation. At a high level, we can think of an English to Spanish translation encoder-decoder as a model that takes in English text, creates an abstract representation of it (the encoder step), and then takes then transforms this representation into Spanish (the decoder step).

An encoder-decoder model for translation first creates an abstract representation of the text, then transforms that abstract representation into another language

An interesting aspect of this is that both the input and the output of the model can be the same type: language.[25] This can be exploited to train a model where the interesting feature isn’t the final output of the model but rather the underlying abstract representation that the model creates. For example, we can create a model that learns to predict the context of a particular piece of data, not with the goal of actually using this prediction, but rather under the assumption that the intermediate representation the model learns for the data is itself interesting and useful for other tasks.

Skip-grams are a way to do exactly that: train a network on sequential data in order to learn contextual information about a particular piece of the sequence. With this scheme, we train a model on a large corpus of text to predict the words that are likely to surround any given word. Once the model is trained and able to do this prediction accurately, we can use its internal state when given a word as the embedding of that word. This embedding should have all of the contextual and semantic information in it, effectively giving us a model of language without needing any data except example text.

Skip-grams train a network to predict surrounding words

The first use of skip-grams with encoder/decoders to create language models happened at the level of individual words in a model called word2vec.[26] While this method was capable of giving a good clustering of the words (i.e., grouping semantically related words together), it also proved adept at a much more fundamental problem — solving word analogies. That is to say, not only is it able to find synonyms for words, but by doing the mathematical operations for “King - Man + Woman” it finds the word “Queen”. This indicates an understanding of language that goes beyond the meanings of individual words to the relationships among them.

Word2vec encodes words as vectors, allowing word analogies to be solved through arithmetic

Since this capability was understood, there have been many attempts at extending it beyond the word level. The main focus has been on sentence-level embeddings, but there has also been work on hierarchical models that use word, sentence, and paragraph embeddings at the same time. This problem is harder than individual word embeddings because, while a word has a small number of meanings that can be listed in a dictionary, but sentences have broader and more subtle semantics.

One model for sentence-level embeddings that requires particular mention is skip-thoughts.[27] This method uses skip-grams to predict the sentences that come before and after a given sentence. It was trained on the BookCorpus dataset,[28] but the resulting embedding is incredibly versatile. In the same way that word2vec embeddings are able to solve analogies (see Figure 4.8) without being specifically crafted to do so, skip-thoughts embeddings have been able to solve a wide variety of tasks such as paraphrase detection, semantic relatedness, image-sentence pair ranking, and question classification. Moreover, it can do this at near state-of-the-art accuracies with a simple logistic regression model built using the embedding as input.

The skip-thoughts model works similarly to skip-grams but acts on the sentence level

Other Uses

While we have focused on exploring the uses of RNNs to create robust language models, the general ability to learn representations of sequential data has opened up completely new fields to neural learning methods. RNNs have become an invaluable in topics ranging from machine translation to bioinformatics.

Neural networks can be assembled to work in a wide range of domains

The simple composability of neural networks, solutions in one field easily become aids for other fields. That is to say, when a new image analysis model is released, any field that also can take advantage of images in some way can use this new model. Sometimes the methods can even be carried over to different types of input (as was the case with attention models that were first applied to images, then migrated to text and even to neural decision making). The various neural network models can be seen as building blocks that can be easily fitted together to solve harder problems than any of them could alone.

5. Prototype - Brief

When approaching the problem of text summarization using recurrent neural networks (RNNs), we followed the vectorize-score-select framework for summarization algorithms described in A General Framework. This provides us with good point of comparison between new and old methods, as well as highlighting the modular nature of neural methods.

We used recent innovations in neural language models perform the vectorization step (see Language Models with RNNs). For the scoring step we experimented with various approaches, from logistic regression to deep recurrent neural networks. The differences between these methods and how they were compared are explored in Scoring Model.

Dataset

Neural networks are a supervised model, so in order to train on to do summarization we first needed a corpus of documents and corresponding hand-written summaries from which the system can learn. Because we are working with extractive summaries, we needed a dataset that maps sentences from some input text corpus to scores that designate how valuable each sentence would be in a summary.

Finding this dataset was a challenge. There are many public datasets (e.g., NIST’s DUC dataset[29]) but they tend to be either from a narrow domain or small. Instead, we decided to scrape The Browser[30] for its articles and summaries. We picked this source for a few reasons:

  1. Summaries from the main site often include direct quotes from the original text.

  2. The articles are very editorial in nature, making them better suited to neural methods than classic heuristic methods because their semantic content is more important than heuristics such as existence of nouns.

  3. Articles linked to from The Browser are from reputable websites that many of us at Fast Forward Labs already read.

The first point is the most important: we can use the direct quotes from the provided summaries to score sentences from the target articles in terms of how important they are.[31] This tells us how likely it is that any given sentence will be incorporated in an extractive summary.

The Browser collects articles of note and provides a short summary of each

We collected 18,000 of the handwritten summaries provided on The Browser’s main site, and stored them alongside the corresponding full article bodies.

Language Model

With a dataset in hand, we were ready to pick a language model to implement the vectorization step of extractive summarization framework (see Vectorize for an explanation of this phase).

As described in Language Models with RNNs, recurrent neural networks can create powerful language models that accurately capture text semantics. Further, we needed a language model that can understand the relatedness of two sentences.[32] This ability would give our model more power when it comes to articles that compare multiple viewpoints or ones that build up an argument about a specific topic. Lastly, we needed a model with a large vocabulary. Language models have less predictive power with text that frequently uses words outside their vocabulary. Our dataset comes from a wide range of sources and our system hopes to be useful on arbitrary articles, so we knew that a small vocabulary would substantially limit the final model’s utility.

With all of these factors in mind, we settled on skip-thoughts.[33] The main benefits of this model are:

  1. It is trained on a large dataset of novels which encode very deep semantics compared to the normal Wikipedia training set.

  2. It has a large vocabulary from its training data and a smart vocabulary expansion mechanism.

  3. It performs well on entailment tasks.[34]

  4. It is open source (released under the Apache 2.0 license).

  5. The trained model is provided.

This model provides a transformation that takes in an English sentence and turns it into a 4,800-element embedding vector. This vector can then be used as the input for any further analysis that needs to operate on the semantics contained within that sentence. This further analysis can even be as simple as taking the cosine distance between two sentence representations in order to see how similar they are in meaning.

Scoring Model

The next phase in the extractive summary framework is to score the sentences of the article using the vectorized forms from the previous step. There is quite a lot of flexibility in this step, and this is when we are able to encode some of our assumptions about the rhetorical structure of the documents we are summarizing. We considered two main model types: recurrent and feed-forward neural models.

In this context, we can think of a feed-forward network as scoring sentences solely based on each sentence’s representation as given by the language model. On the other hand, the recurrent model uses information from other sentences to contextualize the sentence it is scoring.

Both types of models were trained in the same way. We used an 80%/10%/10% training/validation/test split on our dataset. In this context, the training data is what is used to change the weights of the network, the validation data is used to benchmark how well our training is going, and the test data is used as a point of comparison with other models. The optimization algorithm was Adam,[35] and we trained until the error on the validation data did not increase for two to five training epochs, depending on the model (a scheme called early stopping with patience). With this setup, training typically took 8 hours on our TitanX GPU (see Training Time for a more in-depth discussion of training times).

Feed-Forward Scoring

It is initially tempting to try a complex model structure when working with neural models. But our experiments with various feed-forward scoring models show this isn’t necessarily the best thing to do. After trying many different multilayered networks with different numbers of hidden nodes and activation functions, we found that the most performant option was simply a single-layer neural network with a sigmoid activation function (see top left of Table 5.1). This model can be seen as simple logistic regression!

Sketches of some of the networks tried. The single-layer models (top left and right) performed best

While it is hard to say exactly why this simple model performs so much better than more complex ones, there are some telltale signs that point to the systematic noise in the training data being responsible. More complex model structures were able to learn these noise signals and overfit on the noisy process. This gives the model better error rates when dealing with similarly noisy data; however, the validation done in Table 5.1 used a completely held-out dataset without such noise.

This systematic noise probably comes from edge cases in the scraping code used to acquire the dataset from The Browser (see Dataset). For example, author or location information that is sometimes present at the beginning of the first sentence of an article may have been improperly filtered out and served as an indicator to the model.

Recurrent Scoring

Recurrent models have the benefit of being able to take context into account when scoring a given sentence. This is because of the recurrent neural network’s internal state (which we describe in Overview of RNNs). For the case of text summarization this is particularly advantageous, since the score of a particular sentence can be determined not only in terms of the sentence’s semantic qualities but also of the rhetorical structure of the text before it. It is for this reason that we used a language model that has good properties when it comes to the task of relatedness: the relational quality of a sentence to the previous sentences is incredibly important when exposing rhetorical structure.

The model that performed the best was again a relatively small network. It had one LSTM layer (see Overview of RNNs) with 512 hidden states. In addition, it was trained with larger batch sizes and more patience for the early stopping algorithm. All of these features combined helped us avoid the problems with possible systematic error in the dataset as described in the previous section.

Validating Results

In Result Quality we describe the issues that make it difficult to validate generic summarization algorithms. The neural network approach suffered from similar problems, but we designed our experiments in a way that would mitigate this.

Most importantly, we reformulated the summarization task into a strongly supervised one. The available metrics to score the quality of a generated summary has problems (something we describe in Result Quality). Instead, we use a metric that assumes we are given a human-generated summary with extractive elements to it (i.e. has quotes from the original article) and use this to train our system.[36] This has the advantage that our problem becomes entirely supervised and doesn’t simply replicate problems with previously available scoring mechanisms. This scoring is also robust enough to allow us to benchmark against other available methods and use mean squared error reliably to measure the effectiveness of the various algorithms (see Table 5.1).

However, we still suffer from error in our training scores since they are dependent on who wrote the summaries we are optimizing for. In an article there are often several sentences that share the same meaning, and would therefore be equally useful would in a summary. But because we use the simple Jaccard score when creating our dataset, we give different scores to two sentences that share the same meaning but don’t share the same words.

Model Mean squared error (* 100)

Average Top 3 Models

2.55

Single Layer LSTM

2.58

Single Layer LSTM

2.59

Single Layer Feed Forward

2.62

Single Layer LSTM

2.64

Two Layer Feed Forward

2.65

Average All Models

3.37

Single Layer LSTM

4.39

Two Layer LSTM

4.41

1 LSTM + 2 Feed Forward

7.48

Random Scoring

18.84

Table 5.1: Model errors using mean squared error.

Designing Brief

The Brief prototype

In designing our prototypes, we work to both demonstrate the technology we are investigating and create a compelling standalone product experience. As we iterate, our growing understanding of the technology informs our product vision, and our vision motivates us to push the tech. You can see the interplay of these constraints and possibilities in the design process that went into our summarization prototype, Brief.

The Problem

Part of our interest in text summarization grew out of discussions with our clients. Many of them are struggling to make efficient use of the often overwhelming amount of unstructured textual data available to them. As digital storage has become cheaper and cheaper, it’s become possible to amass massive amounts of reports and internal documents, but the tools to then efficiently mine those texts for relative information have not yet been developed.

This dilemma was familiar to us on a personal level, as the number of interesting articles we come across online far outstrips the time we have to read them. For those of us who use read-it-later, this mismatch is visible as an always-growing queue of saved articles. As we brainstormed product possibilities, we dreamed of creating something that could help us better process all that text.

The Product

Inspired by products such as Instapaper[37] and Readability[38] that provide enhancements to the article reading experience through a browser extension, we imagined building a “summarize” button for the browser. Compared to other approaches we explored, such as a news summarization product that focused on summaries of breaking news topics, an extension offered the most freedom to users, allowing them to summarize anywhere across the internet.

That freedom had its trade-offs — because our model was trained on a certain type of article, and it performs better within that domain, by giving people free rein we were increasing the chance that they would apply the summarization to an article it would not do well on. We were also setting ourselves up to have to deal with different sites’ idiosyncratic HTML formatting.

In the end we felt those challenges were worth it, because this was the product we most wanted ourselves, and because it persuasively demonstrates one of the key values of machine learning: the ability to scale beyond human capabilities. Instant summaries of whatever you want whenever you want them — it sounded like the future.

Early Prototype

Through early tech experiments, we quickly reached the conclusion that our summary would be an extractive one, pulling its sentences from within the article itself. The first design priority was getting a quick and dirty idea of how those sentences would feel as a summary. We put together a bare-bones extension that listed the top-scored sentences in order.

Our original internal extension

This initial experiment revealed several of the challenges that would drive the design. First, summaries are subjective; as we experimented with different algorithmic models, it became apparent that some were better than others, but it was very difficult to describe how they were better. Second, because the sentences were extracted from the articles themselves, there were issues with a lack of context. This was particularly apparent in the case of pronoun disambiguation, as mentioned in Select. A well-selected sentence that starts with an ambiguous “she” or “he” is confusing in isolation.

These challenges meant that a summary constructed out of, for example, the five top-scoring sentences in the article was not compelling, or often even understandable, in isolation. Doing things like displaying the top scorers alongside their surrounding sentences helped with context, but it was getting messy. A rethink was in order.

Reframing

Presenting the extracted sentences as a standalone summary was not working — it was promising the user something we were not delivering. It was time to reassess; to think about what we could deliver, and reframe the product to match that.

The inspiration for the next iteration was within the data. For our summary we were only displaying the top sentences, but our algorithm returned all of the sentences from the article, each one with a score rating its relevance to the rest of the article — a ranking of each sentence by importance. What if we used all of those scores? What could we do with that?

We tried a heatmap, visualizing the highest-rated sentences at the green end of the spectrum and the least important at the red end. This demonstrated the work the algorithm was doing while preserving a sentence’s context. As a visual, it also looked interesting; people who tested it were intrigued.

Experimenting with an article heatmap

It was, however, very distracting, making it difficult to actually read the article. There was something there, but this representation of it was too extreme. Time for another iteration.

Highlights

Working to decrease the disruptiveness of the heatmap, we experimented with more subdued color schemes. This eventually led us to what could be thought of as the original, analog textual heatmap — the highlighter. Instead of showing every sentence’s interest value, we highlighted only the interesting ones. As an article reader, you’re not really interested in having the difference between the 14th- and 15th-rated sentences visualized; you’re interested in being shown where the important stuff is.

As the highlight-focused design started to take shape, we brought those top five sentences back in, this time not as a standalone summary but as a sidebar of the top five quotes. It’s always a good feeling when you bring back an early design idea and it slots right into the current iteration, like the return of the prodigal design element.

The final layout set user expectations for highlights, not a standalone summary

The data was the same, but we had reframed the expectations. Instead of promising a summary, we promised highlights. The challenges of lost context were solved by providing the full context of the article itself; a click on a sidebar highlight would take you to its location within the article. The extractive nature of the summary was now a core feature of the product, as exemplified through the in-article highlights.

Having settled on this visualization, there was still fine-tuning to be done on how many highlights to display. Because articles are not all the same length, displaying a set number of highlights results in densely highlighted short articles and sparsely highlighted long ones. We experimented with setting the number of highlights to display as a proportion of article length. This made for stabler densities across articles, but a disorienting sidebar experience — the user was never sure how many were going to show up and what the rationale behind that number was. Our solution was to again take a hybrid approach — the number of top highlights (five) would remain constant, while the secondary level of highlights, shown in a fainter yellow, had its number determined proportionally to article length.

Further Possibilities

For Brief, we decided that the highlight metaphor was the most appropriate, but there are many other options to explore with this kind of summarization data. An alternate, more aggressive view of that data is available in Brief’s skim mode, where lower-scored text is blurred to the point of unreadability.

Brief in skim mode

Blurring is an extreme step, and not one we would generally recommend for a consumer product, but in the case of this prototype it has the virtue of clearly communicating the technological capability. If highlight mode is a suggestion of what to focus on, skim mode enforces that focus. Having solved the problem of loss of context with the highlighting mode, we took a step back the other way with skim mode, to show the real possibility of shortening an article. With improved summarization algorithms, and support such as pronoun disambiguation, skim mode could become the primary interface; however, for this prototype it is best as an experimental demonstration.

Brief history view. Summarization algorithms could be used on front-page or index views to provide the best preview of the content

Another use of the summarization information is present in the history section of Brief (viewable on the Brief main page or through the menu). This page shows links to previously summarized articles, highlighting each article’s top-rated sentence. This small feature demonstrates some of the auxiliary features that can come out of sentence scoring. Being able to dependably select an information-rich sentence from any selection of text opens up possibilities to display more interesting links in a variety of situations — from the feed view within your own app to sharing outside to social media (a simple Twitter share helper could be built off our tech by combining sorting by score with a tweet length filter). A set of recommended articles could easily be expanded from a list of titles to one of intriguing quotes. It could also automate the selection of sentences to emphasize within articles through treatments such as pull quotes. Features like these show how summarization can be a useful tool in a wide range of design and product situations.

6. Recurrent Neural Networks - Challenges

In this chapter we discuss the practical challenges that are particular to the neural network approach to summarization. In the following chapter (Building a Summarization System) we compare and contrast these challenges to those of simpler approaches such as Luhn’s algorithm and topic modeling and touch on the philosophical implications of RNNs.

The freedom to define the model structure is part of the power of neural networks, but increases research time. Training time can range from several hours to several days, which slows the turnaround time for a particular model configuration. And the initial deployment of a neural network can be extremely complex. In the following sections we outline the structure of an engineering project using recurrent neural networks. We focus on the issues of model structure, training time and deployment.

Engineering Time

The development of a prototype such as the one described in Prototype - Brief has several phases.

The first phase is data acquisition and processing. Neural methods generally require much more data than classical machine learning methods. The amount of data required depends on the model structure and the exact problem being solved, but a general rule of thumb is to have many more samples than parameters in the model.

The next phase is model training and validation. During this phase, new model structures are created, trained, and evaluated against each other. The time this takes is highly variable as it involves luck both in picking a favorable structure and in the training (because neural network training starts with a random initialization, luck can still play a role in model performance.)

In the final phase, the model is deployed into an existing infrastructure. As we explored in Deep Learning: Image Analysis, this can be simple if servers with GPUs are already available. In this case, a neural model can be deployed in the same way as any other model.

If compatible GPUs are not already in your infrastructure, getting them supported can potentially be a time sink. In AWS you can provision a GPU machine in minutes thanks to community support,[39] but custom GPU hardware deployment rapidly becomes extremely complex.

Model Structure

Determining the model structure needed to solve a particular task can be a hard problem. This is because RNNs are complex models with many dials to tune. This results in the need to iterate many times on the fundamental structure of the model. Coupled with the long training time, this can slow down development of these systems. However, unlike with other types of models, this slowdown is very unpredictable.

Luckily, there are fantastic tools to help make modifying model structures easy and fast. Our current preference is Keras[40] for most general-purpose neural network needs. The model supports all sorts of model types and is very much plug and play. Furthermore, it uses the familiar scikit-learn machine learning API, which highlights how modular these neural network APIs can be. Lastly, Keras supports both Theano and Tensorflow backends, making it incredibly portable, extensible, and usable on all sorts of systems.

Training Time

Training time is a challenge when working with neural networks. While the conceptual model of an RNN can be simple to grasp, computationally there are many parameters that need to be learned, which means a large training set is required. Furthermore, the training procedure is iterative and expensive (needing several dozen to several hundred iterations over the entire dataset before converging). This results in long wait times between creating an RNN and having it fully trained and ready to validate.

Moreover, training neural networks involves random initialization. This means that we can run the same training algorithm with the same model and the same data and get different results. In many cases, the models will converge on the same result, but this isn’t guaranteed. Models will generally converge better with more data; however, an RNN structure could have been chosen that simply will not solve the problem no matter how much data is thrown at it.

While exploring different model structures for the prototype (Model Structure), we found that it took about 8 hours to train on 18,000 sample article/scores pairs using a TitanX GPU and 32 GB of RAM. The biggest reason for this slow performance was that each training sample was in itself quite large (4,800 floating point numbers for every sentence in the source article). As a result, data had to be constantly loaded into and out of memory, creating an I/O bottleneck. Using a computer with 128 GB of RAM or more would alleviate this problem, but we were not able to test on such a system.

Our model was built on top of an off-the-shelf language mode.[41] These language models are quite versatile and can be applied to many different domains of English, but if an off-the-shelf model does not fit your needs, then similar such models take about two weeks to train.

Phase Low experience High experience

Data processing

Week

Week

Model creation

Weeks

Days

Training

Day

Day

Validating

Week

Days

Deployment

Week

Days

Table 6.1: Estimated engineering time for neural language systems.

7. Building a Summarization System

In this chapter we compare and contrast the data, engineering, and deployment challenges presented by the three approaches to summarization described so far: bag-of-words/heuristic systems (Extractive Summarization, topic modeling (Summarization Using Topic Modeling), and recurrent neural networks (Prototype - Brief).

Training Data

Finding a good dataset to train your model is often the hardest part of creating a machine learning system. The three summarization systems discussed in this report can be placed on a scale from algorithms that require no training data at all, to algorithms that require large numbers of example documents and model summaries. But there is a trade-off: the less training data an algorithm requires, the greater the need for engineers themselves to understand the documents.

The tradeoffs in training data and domain expertise required for different summarization systems

With a bag-of-words or heuristic summarization system (Extractive Summarization), you don’t necessarily need any training data. Rather, you need domain knowledge of the types of text you will be summarizing and the aspects of the text you would like to highlight. For example, if summarizing legal data, you could speak to a lawyer and get an understanding of which keywords or sentence structures are signifiers of importance. You could then write explicit rules based on this knowledge.

With topic modeling-driven summarization (Summarization Using Topic Modeling), you need a large amount of sample documents, but you don’t need corresponding hand-written summaries. You just pick the number of topics (a subtle decision we discussed in The Number of Topics) and train the system. There is still some domain-specific tuning required; in order to verify that the topics the model finds are a stable description of the documents, you must understand the underlying documents. And after verifying that the topics make sense (and perhaps labeling them by hand), you must then choose how to select and present extracts — for example, whether it is better in a particular context to present a few sentences from many topics, or many sentences from one topic. Again this requires some domain knowledge.

Finally, the data requirements of the RNN summarizer are the most onerous of all the algorithms presented in this report. Using a recurrent neural network (Prototype - Brief), you can you can use an off-the-shelf language model such as skip-thoughts or word2vec for the embedding/vectorization step. But you still need a large sample of training documents (as with LDA). Further, these training documents must have corresponding summaries.

Interpretability

There are two aspects to interpretability in the context of summarization. First, are the decisions comprehensible by a human? Second, does the intermediate vector representation provide extra information beyond the raw text?

The RNN approach to vectorization and summarization described in Recurrent Neural Networks - Background is the worst of both worlds. Researchers are just beginning to visualize and attempt to understand the internal workings of RNNs as they process language data.[42] But for now there is no simple answer to the question, “Why was this sentence selected?” In the end, these intermediate embeddings are only understandable in comparison with one another but any global meaning is difficult, if not impossible, to understand.

Heuristic and bag-of-words summarization approaches are transparent, to the extent that the reason for their selections is understandable, but these approaches do not provide the additional insight of topics.

Topic modeling is the most interpretable approach discussed in this report. As with heuristic and bag-of-words approaches, there is a straightforward answer to the question of why a sentence was selected. But the topic modeling approach has the additional benefit that the intermediate vector, which gives the topics in an sentence or document, reveals in a human-readable way information that is hidden in the text. This extra information is not obvious to the reader from the text alone; the topic model discerns it by comparing the text to its memory of a large corpus of training documents.

Computational and Deployment Challenges

The computational requirements of the three approaches are very different. These differences naturally affect calculation time, which has consequences for development, training time, and deployment. But they also extend to the requirements of global state, which can increase the complexity of deploying such a system by requiring synced databases or shared filesystems.

Heuristic approaches have minimal computational and memory needs. The computation can be made exceedingly simple; in fact, Luhn’s algorithm (A Simple Implementation) is so simple it can be performed in a browser. More complex heuristic methods (such as KLSum[43] or LSA[44]) are not so well suited for in-browser computation, but are still simple enough that no special considerations need to be made for the server doing the processing. For state, they generally only need a list of stopwords for the language they are summarizing.

In contrast, topic modeling with LDA requires substantial preprocessing for the model to learn topics. Once this preprocessing is done, future computations are simple but require a large shared state that gives the correspondences between words and topics. Luckily, this global state is perfectly suited for a key/value database and can easily be made distributed and fault tolerant if necessary (or held in memory if sufficient memory is available). Furthermore, if the data is held in a database you can easily update the topics using online LDA,[45] which ensures that our model remains up to date. As a result, we need only moderate computational capabilities to perform classification quickly.

Finally, recurrent neural networks require very large amounts of preprocessing and a large static global state. The global state is the internal state of the model once we have completed the training phase and can easily be on the order of gigabytes (in the case of our prototype, the language model is of the order 4GB and the summarization model is on the order 100MB). However, this global state is static and is always needed in its entirety, which means it can simply be shared on a network filesystem to all the servers that need it. Lastly, as discussed in Engineering Time, the computations done by the RNN are best done on a GPU. This can add extra complexity to the deployment of such a model, but it makes the resulting calculations run extremely quickly.

The tradeoffs in computational cost and interpretability for different summarization systems

Result Quality

It is largely meaningless to compare the multi-document topic-driven summaries in Summarization Using Topic Modeling to the more coherent single-document summaries in Prototype - Brief. Which approach is best necessarily depends on the kind of summaries you want to generate. If you want a coherent extractive summary, then you should aspire to an RNN system. If, however, an unstructured, unconnected sampling of representative sentences is your goal, then go with topic modeling.

But the problem of validating results goes deeper. It’s not only difficult to compare two different systems: it’s hard to simply measure the quality of a summary at all. Indeed this issue is a subject of current research.[46][47][48]

Historically, the NLP community has used a score called ROUGE.[49] This metric assesses a summary by looking for words or phrases that overlap with those in a gold-standard hand-written summary of the same text. If the gold-standard summary is extractive then, according to ROUGE, the algorithm should select the same sentences. But the problem with ROUGE and similar metrics is that two sentences in the source document can convey the same information but use entirely different words to do so. This often happens in news articles, where the same information is paraphrased and repeated. Two summaries can convey similar information while having almost no overlap.

For comparison with other models that can be found in the literature, we provide the ROUGE scores for our models and some more traditional heuristic models.[50] An interesting feature of ROUGE which is well known within the community is that it is hard to simply beat the score of a summary that consists of the first several sentences of the target article. An interesting feature of our models is that we were able to beat this with two of our models.

Model Avgerage Recall Average Precision Average F-Score

Single Layer LSTM

0.437

0.443

0.439

1 LSTM + 2 Feed Forward

0.420

0.425

0.421

First 100 of Article

0.414

0.421

0.414

Single Layer LSTM

0.411

0.418

0.413

Edmund Heuristic

0.410

0.400

0.402

TextRank Heuristic

0.409

0.398

0.401

Lex Heuristic

0.405

0.396

0.397

Luhn Heuristic

0.402

0.390

0.393

Two Layer LSTM

0.388

0.406

0.397

LSA Heuristic

0.381

0.378

0.376

KL Heuristic

0.376

0.368

0.369

Random

0.368

0.361

0.362

Table 7.1: Model ROUGE scores for the first 100 words of the summary.

Many alternative metrics have been proposed, but there is little agreement in the community as to which one is optimal. The fundamental problem is a bootstrapping one: to evaluate the language model used by a summarization algorithm, you need to have a better language model. Currently, the only way to solve this problem is to use humans as the validators, or to design the summarization validation task in a subdomain that can be quantitatively verified.

In order to deal with the lack of generalizable validation scores, we designed our RNN learning task to have a quantitative measure of success. This allowed us to compare candidate neural network structures. We were able to do this in the RNN case because that model is built on a training set with corresponding hand-written summaries (see Validating Results), which is not the case with topic modeling.

When Simple Methods Work

In some special cases, a simple heuristic or bag-of-words approach like Luhn’s algorithm is perfect. We discussed the limitations of such approaches in Fundamental Limitations of Heuristics, and overcoming these limitations was in a sense the motivation for this report. But sometimes these limitations are irrelevant. The simplest approaches may be fine if:

If you’re lucky enough that all of these conditions hold for you, you may get acceptable performance more quickly (both in terms of developer time and computer time) by avoiding the complexity of topic modeling and neural networks.

8. Summarization Products

Summarization has been an active area of research and development for years, and there are several products in the market today that use this capability. This section reviews the landscape of summarization solutions as a guide to figuring out which vendors offer useful building blocks for product development. We also review open source projects that may be useful, and explore a few products that use summarization as a feature, to highlight current and emerging applications of the technology across industries.

Commercial Summarization Products and APIs

The companies discussed in this section offer commercial-scale summarization tools as products or as APIs for use by others. They each provide general-purpose solutions as well as industry-adapted options (finance for Agolo, airlines for Lexalytics, and social media for Aylien).

Agolo

Agolo[51] offers a remarkably fast summarization implementation. Its product is currently focused primarily on finance applications, and it can accept various input sources (such as social media, articles, and even books) and produce single- and multi-document summaries. One notable feature is the system’s ability to provide incremental updates as new stories on a given topic appear.

The system’s speed scales linearly with the number of documents, which means it can process hundreds of articles on the same topic fairly efficiently (though millions are still somewhat out of reach). Agolo provides API, SaaS, and custom solutions.

When asked about what the company sees emerging in the next few years, co- founder Mohamed AlTantawy said, “our goal is to mimic the human thought process of contextualizing real-time data. For example, if you see a news item about the iPhone, a human will immediately identify that with the wealth of knowledge they know about Apple: that Tim Cook is the CEO, that the iPhone is their major product, and so on. We aim to mimic this human process algorithmically, and thus create what we call an AI analyst — a universally available AI that can write simple summaries from any unstructured data sets.”

Co-founder Sage Wohns added, “We view summarization as integral to the future of search. Beyond the 10 blue links [on the first page of a search result] lies valuable content. With screen sizes shrinking and unstructured information growing, summarization is the key to unlock useful output for everything from search queries to push notifications from email and social networks.”

Lexalytics

Lexalytics[52] offers text summarization through its Salience 6 text analytics engine, either as an on-site software package or as a service through its Semantria API. Salience performs extractive summarization using lexical chaining to identify the sentences in a document that are most important or most relevant to an entity of interest. Salience supports text summarization in a dozen languages, including English, Spanish, French, German, Japanese, and Korean. Lexalytics CEO Jeff Catlin has noted that the company is particularly interested in the mobile applications of summarization.

Salience includes a graph representation of relationships among topics derived from Wikipedia data. This representation allows the engine to be sensitive to the semantic context of input text. Lexalytics maintains that its Wikipedia- trained model is sufficient for most purposes with minor tuning, but it also offers client-specific model training tailored to particular applications if necessary.

Based in Boston, Lexalytics has been working in text analytics since 2003. It is known for sentiment analysis, and its Salience product is used by DataSift, among others. In 2014, Lexalytics acquired Semantria, enabling it to offer Salience as an API (and as an Excel plug-in, though text summarization does not seem to be supported in the Excel version). They have since begun releasing products targeting particular industries.

Aylien

Based in Dublin, Ireland, Aylien[53] was founded in 2011 by Parsa Ghaffari with a focus on news and media industries. Aylien provides a fully featured text analytics API that includes extractive summarization. Aylien claims that its analytics API is applicable across all general use cases, but optimized somewhat for news and social media content.

We tested the Aylien API and found that the supplied Python library made it very easy to integrate the functionality into our test code. Implementing a simple summary generator from a website URL was completed in the Aylien sandbox within a few minutes.[54]

In addition to summarization, Aylien’s API also includes content, entity, and concept extraction; text classification, including taxonomic classification; image classification (though this is not directly text-related); sentiment analysis; and related phrase and hashtag suggestion. Aylien also offers an unlimited on-site solution for customers who need more or faster API access.

Open Source Summarization Tools

The following projects are open source text summarization packages. Most of these projects implement simple heuristic approaches. These methods do not generally give as good results as the more advanced methods, but as discussed in Building a Summarization System, they are sometimes sufficient. At a minimum, they may be useful as example implementations or testing benchmarks.

Summarization Applied in Products

The following companies are using text summarization as part of their own products. All of these offerings appear to be either early-stage projects or older technology; they are included here to provide points of reference and potential options for building your own products.

The majority of products surveyed here focus on the media industry and consumer’s reading experience, either on the web (via a browser extension) or mobile. While media is a natural initial focus, we see future applications in business operations and understand customer behavior, personalization and recommendation systems, healthcare, and even the industrial internet or internet of sensors.

Conclusion

The commercial field of summarization is not crowded. A handful of legitimate commercial options exist, and a few new products are in the works. While there are no open source solutions that we recommend using exclusively, a number of open source projects offer good baseline implementations for testing. The introduction of deep learning techniques to the summarization problem has recently renewed interest in the space, and we expect to see progress over the next year in both the development of open source solutions and startups using these capabilities to build products.

9. Ethics

The goal of a summarization algorithm is to reduce the length of a document while retaining as much meaning as possible. But there’s always something lost, be that a minor (but potentially important) viewpoint, the beauty that arises from poetic form, or the rhetorical flourish that drives emotions in advertising or politics. The majority of the ethical issues related to text summarization result from the fact that we lose something when the data is reduced.

This section presents an overview of the primary ethical considerations to consider when using text summarization: authorship and copyright, suppression of information and viewpoints, and interpretability.

In our first report, Natural Language Generation, we considered questions of authorship that arise when we use algorithms to generate text. When an algorithm uses fragments from an existing corpus to generate new text, should credit be given to the original author or the algorithm creating the new text? Does transposition to a new context count as an act of original authorship?

The same questions arise with extractive summarization, where algorithms select text verbatim from the original rather than reformulating the underlying ideas in new words. While the ethical questions on authorship are complex, it’s useful to formalize them as legal issues of copyright.

Indeed, internaional laws for text copyright and algorithmic reuse are currently changing in response to the adoption and release of new tools. In 2014, for example, Spain passed a new intellectual property law that tipped the balances strongly in favor of the original author: news aggregators like Google were charged for showing snippets and linking to news stories, eventually causing Google News to shut down operations in Spain.[55]

Freedom of speech protections under the First Amendment will likely thwart such regulations in the United States. In its 2013 “Statement on Text and Data Mining,”[56] the International Federation of Library Associations (IFLA) argued that researchers need the right to freely share the results of “text and data mining” to promote new forms of research “as long as these results are not substitutable for the original copyright work.” Still, the group distinguishes between using analytics for research or commercial purposes, suggesting that commercial reuse may result in more complicated legal issues than found in the research community.

In short, organizations with global operations and/or organizations intending to reuse extractive summaries for commercial purposes should carefully review applicable copyright and intellectual property laws as part of project preparation. Organizations using the tools for internal use only (e.g., knowledge management) will face less risk and regulatory scrutiny.

Suppression of Information and Viewpoints

The implicit goal of many approaches to summarization is to extract sentences that best match the overall meaning of a document. For documents with a single viewpoint, this may be fine, as what’s lost may be filler information or supporting arguments. But the algorithms might overlook less popular viewpoints in documents or corpuses with greater topical variety. In our Amazon review summarizer prototype (Summarization Using Topic Modeling), we found that our algorithm often identified uncommon viewpoints (for example, a rare adverse reaction to a beauty product), but there are no guarantees this will always be the case.

The output of a summarization algorithm will reflect that algorithm’s inherent biases. This is an issue with any method of condensing language. Product designers should bear this issue in mind while applying algorithms.

Interpretability

Any time we use a neural model, we need to be mindful of interpretability. With deep learning, feature engineering is left to the algorithm. In the case of text summarization, we can never be certain why our RNN ultimately decided to select one sentence as a better representative of the ideas in a document or corpus than another. Say we were to use these algorithms to analyze emails to satisfy financial fraud compliance requirements by extracting sentences that may signal corporate fraud. Subject to regulatory scrutiny, it would be practically impossible to explain just why the algorithms failed to identify salient information. Similar arguments hold for any situation where thorough methodological transparency is a legal, business, or product requirement.

Finally, uninterpretable systems are particularly vulnerable to propagating biases in original source data. We may inadvertently select certain viewpoints at the expense of others. This may be compounded over multiple iterations. There is no clear solution to this problem. System designers should consider this when choosing the algorithm and presenting its output to the user.

What You Can Do

There are a few questions you can ask to address these ethical issues:

10. Future

Extractive summarization uses sentences (or other parts) of a document to build up a summary of the document. An abstractive summary, on the other hand, is one that contains sentences, passages, or even words that do not appear anywhere in the document being summarized. A hand-written abstractive summary is generally better than an extractive summary. This is why the headline of a news article is not usually a sentence from the article itself. But automating abstractive summarization is naturally much more difficult than automating extractive summarization. It requires a system that can generate language.

While this capability has not been reached, many groups are making progress. For example, the NAMAS project from Facebook AI Research is a neural model that takes in the first 100 words of an article and generates a headline for it (as a first step toward a fuller summary).[57] The results are very encouraging and show definite promise that abstractive summarization is possible. The model is a classic neural network language model with a fixed input size, but the researchers expect improvements with a recurrent language model.[58] Using such a model would also remove the 100-word restriction, giving the system more contextual information with which to generate the headlines.

Other groups are making progress by noting that translation, simplification and abstractive summarization are fundamentally very similar tasks, and therefore the huge recent advances in translation may be more widely applicable. New datasets such as the Newsela corpus[59] are aiding these efforts. When computer language models become more robust, we can imagine translation software that can not only translate from one language to another but also to different colloquial and rhetorical styles and to different levels detail and sophistication.

Summarization Sci-Fi: Mars Terraform Expansion S-217

A speculative sci-fi story that imagines how summarization technology might be used in the future.

Fatima wakes to a simulated bedroom sunrise and the smell of coffee.

She stretches into her robe as the bed retracts behind her and walks over to inspect her houseplants. She likes the challenge of the heirloom varieties, even if she is on her fourth basil plant this cycle.

She walks over to the kitchen and takes her coffee from the dispenser. After a moment of consideration she instructs it, “Breakfast: yogurt and oatmeal.” Taking her first sip of coffee, she approaches the transparent wall screen that subdivides her studio apartment, swiping up to clear the ambient generative artwork and bring up her morning news brief.

The top item reads “Vote on Mars Terraform Expansion S-217 Today,” followed by headlines on the safety concerns about a new synth-meat hybrid and a fashion piece on the new biomod trend of decorative “dragon scale” skin grafts. An animated reminder to do her morning calisthenics bounces in the lower-right corner.

Fatima blinks at the terraforming headline, opening a one-paragraph summary of the basic issues surrounded by a shifting network graph of opinion writing and video on the issue. She gives an upward nod, dismissing the summary and zooming out to get a feel of the graph. Grouped by similarity of word use and sentiment, the pieces are separated into two main clusters, the pro-terraforming expansionists on the left and eco-conservative anti-expansionists on the right. Smaller clusters appear on the extreme side of each group, tugging them apart, while the moderates hold — for now — at the center. Influential opinions appear larger, weighted by shares and recommendations across the social graph. The network web pulses as gravitational forces are adjusted to reflect shifts in public opinion.

The terraforming issue network influence graph

Fatima eyes the filter mechanism and directs it to show only sources she’s read and recommended before. The display fades out the majority of the nodes, leaving only her trusted sources. She blinks at the display mode toggle and a personalized summary of these sources appears, breaking down the agreements and disagreements between each and putting them in the context of pieces she has previously viewed. The summary indicates that one article, by a source she has found especially influential in the past, has been extremely controversial. She sends that piece to her e-paper. She’ll read it in full over breakfast.

A user-curated custom summary on the terraforming controversy

Feeling like she has a reasonable handle on the issue, she zooms back out to the full graph, and switches to the trending view. It’s dominated by two impassioned pleas, on the pro-expansion side from a Mars colonist famous for his heroic rescue of a fellow citizen a couple of years back, and on the con side from the authoritative-yet-affable scientist host of a popular space-experience program. The popularity of those pieces is not a surprise, but there is a less familiar presence — an eco-art activist group has created experience-art simulating a haunted post-development Mars, using historical imagery from Earth’s eco-crisis. Intrigued, she saves the experience for after breakfast, swipes down to return the screen to its resting animation, and sits down to read and eat.

11. Conclusion

Deep learning, LDA, and the entire set of algorithms presented in this report are emerging tools for gaining insight into unstructured text data. We are just beginning to understand how these techniques will impact our capabilities, and the results are already impressive.

In this report we’ve explored several algorithms for making text computable, with summarization as the immediate application. We’ve shown how to extract highly informative sentences from a single document, as well as how to cluster multiple documents and extract language that represents the entire cluster. This allows us to ask product questions, from “What are the key points in this single article?” to “What are the various points-of-view in this large set of documents, and what’s the gist of each one?”

In our proof of concept (Review Summarization with LDA), we demonstrate multi-document summarization on a set of millions of product reviews. This is useful, as it gives us the ability to understand the meaning of thousands of documents by reading only a few sentences, and it’s clear that this approach is immediately useful for other types of data, such as legal documents, customer biographies and information, product information for e-commerce, and medical data.

In our prototype, Brief (Prototype - Brief), we provide a fully functional extractive summarization product. Brief functions well on news article and other web data, but the approach is transferable, and the prototype further demonstrates the engineering constraints and requirements of building a summarization product.

We are optimistic about future uses of the ability to make text computable. This capability will unlock new types of insight into our language, and enable products that understand human communication in new and fascinating ways.


  1. Stanford’s CoreNLP and Python’s NLTK library are two industry standards. ↩︎

  2. This problem is called pronoun resolution; while simple for humans, it is incredibly hard for machines. ↩︎

  3. Luhn, Hans Peter. "The Automatic Creation of Literature Abstracts.” IBM Journal of Research and Development 2.2 (1958): 159-165. Available at the time of writing at http://courses.ischool.berkeley.edu/i256/f06/papers/luhn58.pdf. ↩︎

  4. Stopwords are the most common words in a language. These are usually prepositions, pronouns, articles, and conjunctions that mean very little in isolation (e.g., “in,” “it,” “the,” and “and”), and they are typically ignored in NLP. ↩︎

  5. This list is by no means exhaustive; you can probably think of more attributes that are indicative. Section 2 of Ferreira et al. (2013) contains a great list of ingenious domain-specific heuristics that have been tried over the years. (Ferreira, Rafael, et al. “Assessing Sentence Scoring Techniques for Extractive Text Summarization.” Expert Systems with Applications 40.14 (2013): 5755-5764.) ↩︎

  6. Note that is it not generally a good idea to summarize the documents separately and then concatenate the summaries. ↩︎

  7. The original LDA paper is Blei, Ng, and Jordan (2003) in JMLR. If you are interested in a more conceptual introduction, however, Blei’s 2012 review in Communications of the ACM is a better place to start. Or, for a more comprehensive mathematical tutorial on both LDA and the necessary background (conjugate priors, multinomial distributions, Gibbs sampling, etc.), see “Parameter Estimation for Text Analysis” by Gregor Heinrich (2006). ↩︎

  8. Heinrich’s 2006 report (see previous footnote) has the mathematical background and some coverage of Gibbs sampling. For more detail on Gibbs sampling, see “Gibbs Sampling for the Uninitiated” by Resnik and Hardisty (2010). For an introduction to variational methods, see “The Variational Approximation for Bayesian Inference” by Tzikas, Likas, and Galatsanos (2008). ↩︎

  9. https://github.com/ariddell/lda ↩︎

  10. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html ↩︎

  11. http://mallet.cs.umass.edu ↩︎

  12. https://cran.r-project.org/web/packages/lda/ ↩︎

  13. https://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/ ↩︎

  14. http://jmcauley.ucsd.edu/data/amazon/ ↩︎

  15. http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html ↩︎

  16. See, for example, Wikipedia in 500 topics and a historical analysis of the New York Times in 1,000 topics. ↩︎

  17. An online VEM approach is described in Hoffman, Blei, and Bach (2010) and implemented in Python in scikit-learn. ↩︎

  18. This algorithm came from the realization that for a given input, a recurrent neural network can be unrolled and mapped to a feed-forward neural network, where we can then use normal backpropagation techniques. ↩︎

  19. A good overview of BPTT and the vanishing gradient problem can be found at http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/. ↩︎

  20. Greff, Klaus, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. “LSTM: A search space odyssey.” arXiv preprint arXiv:1503.04069 (2015) Available at the time of writing at http://arxiv.org/abs/1503.04069 ↩︎

  21. In a December 2015 interview, Ilya Sutskever, research director at OpenAI, referred to attention models as the most exciting advancement in deep learning. ↩︎

  22. Two good resources for attention models can be found at http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/ and Mnih, Volodymyr, Nicolas Heess, and Alex Graves. “Recurrent models of visual attention.” In Advances in Neural Information Processing Systems, pp. 2204-2212. 2014. ↩︎

  23. Gal, Yarin. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” arXiv preprint arXiv:1512.05287 (2015) Available at the time of writing at http://arxiv.org/abs/1512.05287 ↩︎

  24. Salakhutdinov, Ruslan, and Geoffrey Hinton. “Semantic hashing.” RBM 500, no. 3 (2007): 500. ↩︎

  25. This doesn’t have to be true; there are interesting models that “translate” rich media into text and vice versa. ↩︎

  26. Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. “Distributed representations of words and phrases and their compositionality.” In Advances in neural information processing systems, pp. 3111-3119. 2013. with code available at https://code.google.com/p/word2vec/ ↩︎

  27. Kiros, Ryan, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. “Skip-thought vectors.” In Advances in Neural Information Processing Systems, pp. 3276-3284. 2015 with code available at https://github.com/ryankiros/skip-thoughts/ ↩︎

  28. http://www.cs.toronto.edu/~mbweb/ ↩︎

  29. http://duc.nist.gov/ ↩︎

  30. http://thebrowser.com/ ↩︎

  31. This score comes from the maximum Jaccard score between a given sentence in the article with all sentences in the summary. ↩︎

  32. In the NLP world, relatedness is a measure that takes two sentences and decides whether they agree, disagree, or are neutral. ↩︎

  33. Kiros, Ryan, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. “Skip-thought vectors.” In Advances in Neural Information Processing Systems, pp. 3276-3284. 2015 with code available at https://github.com/ryankiros/skip-thoughts/ ↩︎

  34. https://github.com/ryankiros/skip-thoughts#semantic-relatedness ↩︎

  35. Kingma, Diederik, and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014). ↩︎

  36. Explicitly, for each sentence in our target summary the score is the minimum Jaccard distance between it and all sentences and the sentences in the provided summary. ↩︎

  37. http://www.instapaper.com ↩︎

  38. http://www.readability.com ↩︎

  39. For example, the community AMI ami-1117a87a is a wonderful base image. ↩︎

  40. http://keras.io ↩︎

  41. https://github.com/ryankiros/skip-thoughts ↩︎

  42. Kádár, Á., Chrupała, G., & Alishahi, A. (2016). Representation of linguistic form and function in recurrent neural networks. arXiv preprint arXiv:1602.08952. ↩︎

  43. http://www.aclweb.org/anthology/N09-1041 ↩︎

  44. http://www.kiv.zcu.cz/~jstein/publikace/isim2004.pdf ↩︎

  45. https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf ↩︎

  46. http://www.cai.sk/ojs/index.php/cai/article/viewFile/37/24 ↩︎

  47. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings2/sum-mani.pdf ↩︎

  48. http://research.microsoft.com/en-us/people/cyl/ntcir4.pdf ↩︎

  49. https://en.wikipedia.org/wiki/ROUGE_(metric ↩︎

  50. We used the sumy python module which can be found at, https://github.com/miso-belica/sumy/ ↩︎

  51. http://www.agolo.com/ ↩︎

  52. https://www.lexalytics.com/ ↩︎

  53. http://aylien.com/summarization/ ↩︎

  54. The test code we used for summarization with Aylien is available at https://gist.github.com/GrantCuster/3b3b76be2956d642c018. ↩︎

  55. For coverage on the law, see http://elpais.com/elpais/2015/01/12/inenglish/1421069667_083191.html. For coverage on the Google News shutdown, see http://www.wired.com/2014/12/google- news-shutdown-spain-empty-victory-publishers/. ↩︎

  56. http://www.ifla.org/publications/ifla-statement-on-text-and-data-mining-2013 ↩︎

  57. https://github.com/facebook/NAMAS ↩︎

  58. http://arxiv.org/abs/1509.00685 ↩︎

  59. https://newsela.com ↩︎