Question

我编写了一个系统，总结了一个包含数千个单词的长文档。在用户调查的背景下，是否有关于如何评估此类系统的规范？

简而言之，是否有评估我的工具保存人类的时间的指标？目前，我正在考虑使用（阅读原始文档的时间/阅读摘要所花费的时间）作为确定节省时间的方法，但是有更好的指标吗？

目前，我向用户询问有关摘要准确性的主观问题。

Answer 1

我不确定时间评估，但关于准确性，您可以参考Automatic Document Summarization主题下的文献。主要评估是文档理解会议（DUC），直到2008年将摘要任务移入文本分析会议（TAC）。其中大部分都集中在高级摘要主题，如多文档，多文档 - 语言，并更新摘要。

您可以在线发布每个事件的评估指南。对于单文档摘要任务，请参阅DUC 2002-2004。

或者，您可以参考维基百科中的ADS evaluation section。

Answer 2

历史上，通常通过与人工生成的参考摘要进行比较来评估摘要系统。在某些情况下，人类摘要生成器通过从原始文档中选择相关句子来构造摘要;在其他情况下，摘要是从头开始手写的。

这两种技术类似于两大类自动摘要系统 - 抽取与抽象（Wikipedia上提供更多详细信息）。

一个标准工具是Rouge，一个脚本（或一组脚本;我无法记得），它可以计算自动摘要和参考摘要之间的n-gram重叠。粗略可以计算重叠，允许在两个摘要之间插入或删除单词（例如，如果允许双字跳过，则安装的泵＆＃39;将被认为是与安装的有缺陷的防洪泵匹配的＆＃39; ＃39;）

我的理解是，Rouge的n-gram重叠分数与人类对总结的评估相当好，直到某种程度的准确性，但随着总结质量的提高，这种关系可能会中断。即，超过一些质量阈值，由人类评估者判断为更好的摘要可以与被判断为较差的摘要类似地评分或者得分。尽管如此，在比较2个候选摘要系统时，Rouge评分可能是一个有用的第一步，或者是一种自动化回归测试的方法，并在将系统传递给人类评估者之前清除了严重的回归。

如果您能够承担时间/金钱成本，那么收集人类判断的方法可能是最好的评估。要为该过程添加一点严谨性，您可以查看最近摘要任务中使用的评分标准（请参阅@John Lehmann提到的各种会议）。这些评估人员使用的分数表可能有助于指导您自己的评估。

Answer 3

一般来说：

Bleu衡量精确度：机器生成摘要中的单词（和/或n-gram）在人参考摘要中出现了多少。

Rouge措施召回：人工参考摘要中的单词（和/或n-gram）在计算机生成的摘要中出现了多少。

当然 - 这些结果是互补的，正如精确与召回的情况一样。如果您从人类参考文献中出现的系统结果中有很多单词/ ngram，那么您将获得高Bleu，如果您从系统结果中出现的人类参考文献中有很多单词/ ngrams，您将获得高Rouge。

有一种叫做简短惩罚的东西，这非常重要，已经添加到标准的Bleu实现中。它会惩罚比参考的一般长度短的系统结果（更多地了解它here）。这补充了n-gram度量行为，实际上惩罚的时间长于参考结果，因为分母越长，系统结果越长。

你也可以为Rouge实现类似的东西，但这次惩罚的系统结果比一般参考长度更长，否则会使他们获得人为的更高的Rouge分数（因为结果越长，你的机会越高）会在参考文献中出现一些词。在Rouge中，我们除以人类参考的长度，因此我们需要额外的惩罚来获得更长的系统结果，这可能会人为地提高他们的Rouge分数。

最后，您可以使用 F1指标来使指标协同工作：F1 = 2 *（Bleu * Rouge）/（Bleu + Rouge）

Answer 4

还有最近的 BERTScore 指标（arXiv'19，ICLR'20，已经有将近90次引用），没有受到ROUGE和BLEU众所周知的问题的困扰。

论文摘要：

我们建议使用BERTScore，这是一种文本自动评估指标代。与常见指标类似，BERTScore计算候选句子中每个标记与每个标记的相似性得分参考句子中的标记。但是，除了精确匹配外，我们使用上下文嵌入来计算令牌相似度。我们评估使用363机器翻译和图像字幕的输出系统。 BERTScore与人的判断力更好地关联，并提供比现有指标更强的模型选择性能。最后，我们使用对抗性释义检测任务来显示BERTScore 与现有案例相比，对具有挑战性的示例更为强大指标。

纸张：https://arxiv.org/pdf/1904.09675.pdf
代码：https://github.com/Tiiiger/bert_score
完整参考：

Zhang，Tianyi，Varsha Kishore，Felix Wu，Kilian Q.Weinberger和Yoav Artzi。 “ Bertscore：使用bert评估文本生成。” arXiv预印本arXiv：1904.09675（2019）。

Answer 5

您可以使用许多参数来评估摘要系统。喜欢精确度=重要句子数/总结的句子总数。 Recall =已检索的重要句子总数/存在的重要句子总数。

F分数= 2 *（精确*召回/精确+召回）压缩率=摘要中的总字数/原始文档中的总字数。

Answer 6

在评估自动摘要系统时，您通常会查看摘要的内容而不是时间。

你的想法：

（阅读原始文件所需的时间/阅读摘要所花费的时间）

没有告诉你很多关于你的摘要系统的信息，它实际上只能让你了解系统的压缩率（即摘要是原始文件的10％）。

您可能需要考虑系统总结文档所需的时间与人类花费的时间（系统：2s，人类：10分钟）。

Answer 7

BLEU

Bleu 测量精度
双语评估研究
最初用于机器翻译（双语）
W（机器生成摘要）在（人类参考摘要）
这是机器生成的摘要中的单词（和/或 n-gram）出现在人工参考摘要中的数量
机器翻译越接近专业的人工翻译越好

胭脂

Rouge 测量召回率
用于 Gisting 评估的面向回忆的研究 -W(人类参考摘要)在w(机器生成摘要)
这就是机器生成的单词（和/或 n-gram）出现在机器生成的摘要中的数量。

系统和参考摘要之间的 N-gram 重叠。 -Rouge N，这里的 N 是 n-gram

reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""

抽象总结

   # Abstractive Summarize       
   len(reference_text.split())
   from transformers import pipeline
   summarization = pipeline("summarization")
   abstractve_summarization = summarization(reference_text)[0]["summary_text"]

抽象输出

   In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)

提取摘要

   # Extractive summarize
   from sumy.parsers.plaintext import PlaintextParser
   from sumy.nlp.tokenizers import Tokenizer
   from sumy.summarizers.lex_rank import LexRankSummarizer
   parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
   # parser.document.sentences
   summarizer = LexRankSummarizer()
   extractve_summarization  = summarizer(parser.document,2)
   extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])

提取输出

Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.

使用 Rouge 评估抽象摘要

  from rouge import Rouge
  r = Rouge()
  r.get_scores(abstractve_summarization, reference_text)

使用 Rouge Abstractive 摘要输出

  [{'rouge-1': {'f': 0.22299651364421083,
  'p': 0.9696969696969697,
  'r': 0.12598425196850394},
  'rouge-2': {'f': 0.21328671127225052,
  'p': 0.9384615384615385,
  'r': 0.1203155818540434},
  'rouge-l': {'f': 0.29041095634452996,
  'p': 0.9636363636363636,
  'r': 0.17096774193548386}}]

使用 Rouge 评估抽象摘要

  from rouge import Rouge
  r = Rouge()
  r.get_scores(extractve_summarization, reference_text)

使用 Rouge Extractive 摘要输出

  [{'rouge-1': {'f': 0.27860696251962963,
  'p': 0.8842105263157894,
  'r': 0.16535433070866143},
  'rouge-2': {'f': 0.22296172781038814,
  'p': 0.7127659574468085,
  'r': 0.13214990138067062},
  'rouge-l': {'f': 0.354755780824869,
  'p': 0.8734177215189873,
  'r': 0.22258064516129034}}]

解读胭脂分数

ROUGE 是重叠单词的分数。 ROUGE-N 是指重叠的 n-gram。具体：

与原始论文相比，我试图简化符号。假设我们正在计算 ROUGE-2，也就是二元匹配。分子 ∑s 遍历单个参考摘要中的所有双元组，并计算在候选摘要中找到匹配双元组的次数（由摘要算法提出）。如果有多个参考摘要，∑r 确保我们对所有参考摘要重复该过程。

分母只是计算所有参考摘要中二元组的总数。这是一个文档摘要对的过程。您对所有文档重复该过程，并对所有分数求平均值，从而得出 ROUGE-N 分数。因此，较高的分数意味着平均而言，您的摘要和参考文献之间的 n-gram 重叠度很高。

   Example:

   S1. police killed the gunman
   
   S2. police kill the gunman
   
   S3. the gunman kill police

S1 是参考，S2 和 S3 是候选。注意 S2 和 S3 都有一个与参考重叠的二元组，因此它们具有相同的 ROUGE-2 分数，尽管 S2 应该更好。一个额外的 ROUGE-L 分数处理这个问题，其中 L 代表最长公共子序列。在 S2 中，第一个词和最后两个词与引用匹配，因此得分为 3/4，而 S3 仅匹配二元组，因此得分为 2/4。详情见论文

如何评估文本摘要工具？

7 个答案: