Automatic Evaluation for Abstractive Summarization

Automatic summarization is a task in Natural Language Processing (NLP), which aims to reduce an input text, and summarize it by a shorter, compressed text that captures the most relevant parts without any loss of important information [1]. There are two types of summarization: in extractive summarization the goal is to extract and combine sentences and phrases exactly as they appear in the original document, whereas in abstractive summarization the goal is to generate a shorter version without necessarily re-using parts from the original text. Text summarization is an area with high relevance in today's age of information overflow, where over 2 million blog posts are written per day, and more than 1 million scientific articles are published every year.

Although summarization has a long history in NLP, there has been little consensus in the research community on how to carry out summarization evaluation. Manual evaluation of summaries is ideal but expensive, whereas the design of automatic metrics that correlate well with human judgements has proved to be difficult [2]. This difficulty is due to the subjective nature of this task, because even human judges often disagree on what constitutes an ideal summary, and multiple summaries can be perfectly acceptable. Further to that, a metric has to deal with multiple constraints, ranking the quality of information content, as well as the grammar and syntax of the summary.

Currently, the most widely used metric for summarization evaluation is Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [3], which relies solely on lexical overlaps between the terms and phrases in the summaries. Thus, ROUGE does not provide an accurate estimate of the quality of an abstractive summary [4], which can contain paraphrases. The goal of this project is to develop a novel metric for summarization evaluation, that extends ROUGE to better accommodate for abstractive summarization. In particular, we will explore incorporating word and phrase similarity measures [5] when calculating the score of the summary. We will test the new metric using abstractive summaries from the newswire and scientific literature domains, and will analyze its effectiveness in comparison to ROUGE and human judges.

[1] Nenkova, Ani, and Kathleen McKeown. "Automatic summarization." Foundations and Trends® in Information Retrieval 5.2–3 (2011): 103-233.
[2] Mani, Inderjeet. Automatic summarization. Vol. 3. John Benjamins Publishing, 2001.
[3] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." Text summarization branches out: Proceedings of the ACL-04 workshop. Vol. 8. 2004.
[4] Cohan, Arman, and Nazli Goharian. "Revisiting Summarization Evaluation for Scientific Articles." arXiv preprint arXiv:1604.00400 (2016).
[5] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.


good programming skills, interest in natural language processing


Nikola Nikolov, niniko (at)

© 2017 Institut für Neuroinformatik