pyrouge元组超出索引

时间:2017-09-13 02:39:03

标签: python nlp summary rouge

我试图使用pyrouge来计算自动摘要和黄金标准之间的相似性。当它处理两个摘要时,Rouge工作正常。但是当它写出结果时,它会抱怨"元组索引超出范围"有谁知道导致这个问题的原因,以及我如何解决这个问题?

2017-09-13 23:54:57,524 [MainThread  ] [INFO ]  Set ROUGE home directory to D:\ComputerScience\Research\ROUGE-1.5.5\ROUGE-1.5.5.
2017-09-13 23:54:57,524 [MainThread  ] [INFO ]  Writing summaries.
2017-09-13 23:54:57,524 [MainThread  ] [INFO ]  Processing summaries. Saving system files to C:\Users\zhuan\AppData\Local\Temp\tmppm193twp\system and model files to C:\Users\zhuan\AppData\Local\Temp\tmppm193twp\model.
2017-09-13 23:54:57,524 [MainThread  ] [INFO ]  Processing files in D:\ComputerScience\Research\summary\Grendel\automated.
2017-09-13 23:54:57,524 [MainThread  ] [INFO ]  Processing automated.txt.
2017-09-13 23:54:57,539 [MainThread  ] [INFO ]  Saved processed files to C:\Users\zhuan\AppData\Local\Temp\tmppm193twp\system.
2017-09-13 23:54:57,539 [MainThread  ] [INFO ]  Processing files in D:\ComputerScience\Research\summary\Grendel\manual.
2017-09-13 23:54:57,539 [MainThread  ] [INFO ]  Processing BookRags.txt.
2017-09-13 23:54:57,539 [MainThread  ] [INFO ]  Processing GradeSaver.txt.
2017-09-13 23:54:57,539 [MainThread  ] [INFO ]  Processing GradeSummary.txt.
2017-09-13 23:54:57,557 [MainThread  ] [INFO ]  Processing Wikipedia.txt.
2017-09-13 23:54:57,562 [MainThread  ] [INFO ]  Saved processed files to C:\Users\zhuan\AppData\Local\Temp\tmppm193twp\model.
Traceback (most recent call last):

  File "<ipython-input-8-bc227b272111>", line 1, in <module>
    runfile('D:/ComputerScience/Research/automate_summary.py', wdir='D:/ComputerScience/Research')

  File "C:\Users\zhuan\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 707, in runfile
    execfile(filename, namespace)

  File "C:\Users\zhuan\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "D:/ComputerScience/Research/automate_summary.py", line 53, in <module>
    output = r.convert_and_evaluate()

  File "C:\Users\zhuan\Anaconda3\lib\site-packages\pyrouge\Rouge155.py", line 361, in convert_and_evaluate
    rouge_output = self.evaluate(system_id, rouge_args)

  File "C:\Users\zhuan\Anaconda3\lib\site-packages\pyrouge\Rouge155.py", line 331, in evaluate
    self.write_config(system_id=system_id)

  File "C:\Users\zhuan\Anaconda3\lib\site-packages\pyrouge\Rouge155.py", line 315, in write_config
    self._config_file, system_id)

  File "C:\Users\zhuan\Anaconda3\lib\site-packages\pyrouge\Rouge155.py", line 264, in write_config_static
    system_filename_pattern = re.compile(system_filename_pattern)

  File "C:\Users\zhuan\Anaconda3\lib\re.py", line 233, in compile
    return _compile(pattern, flags)

  File "C:\Users\zhuan\Anaconda3\lib\re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)

  File "C:\Users\zhuan\Anaconda3\lib\sre_compile.py", line 562, in compile
    p = sre_parse.parse(p, flags)

  File "C:\Users\zhuan\Anaconda3\lib\sre_parse.py", line 855, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)

  File "C:\Users\zhuan\Anaconda3\lib\sre_parse.py", line 416, in _parse_sub
    not nested and not items))

  File "C:\Users\zhuan\Anaconda3\lib\sre_parse.py", line 616, in _parse
    source.tell() - here + len(this))

error: nothing to repeat

黄金标准是BookRags.txt,GradeSaver.txt,GradeSummary.txt,Wikipedia.txt 需要比较的摘要是automated.txt
不应该* .txt或[a-z0-9A-Z] +工作吗?但前一个给了我&#34;没有什么可以重复错误&#34;,后者&#34;元组索引超出范围&#34;错误

r = Rouge155("D:\ComputerScience\Research\ROUGE-1.5.5\ROUGE-1.5.5")
r.system_dir = 'D:\ComputerScience\Research\summary\Grendel\\automated'
r.model_dir = 'D:\ComputerScience\Research\summary\Grendel\manual'
r.system_filename_pattern = '[a-z0-9A-Z]+.txt'
r.model_filename_pattern = '[a-z0-9A-Z]+.txt'
output = r.convert_and_evaluate()
print(output)

我手动设置两个目录。似乎Rouge包可以处理其中的txts。

2 个答案:

答案 0 :(得分:2)

我在 pyrouge 软件包中遇到了同样的问题。发生此问题的原因是,源代码试图以某种模式匹配我们提供的文件名,否则将返回空元组。如果您想了解更多有关此信息,可以查看 Rouge155.py 文件。更具体地说,例如,检查函数 __ get_model_filenames_for_id()

我按照official page中提到的确切文件名说明进行了解析,如下所示:

r.system_filename_pattern ='some_name。(\ d +)。txt'

r.model_filename_pattern ='some_name。[A-Z]。#ID#.txt'

所以,我的建议是:

  • 分别为system_summaries(系统生成)和model_summaries(人工生成/黄金标准)创建两个目录
  • 提供指向这些目录的确切文件路径
  • 如果将一个system_summary(例如SystemSummary.1.txt)与一组model_summaries(例如ModelSummary.A.1.txt,ModelSummary.B.1.txt,ModelSummary.C.1.txt)进行比较,然后提供以下模式:
      r.system_filename_pattern = 'SystemSummary.(\d+).txt'

      r.model_filename_pattern = 'ModelSummary.[A-Z].#ID#.txt' 

您可以根据要评估的摘要数来扩展此范围。

希望这会有所帮助!祝你好运!

答案 1 :(得分:1)

问题是流氓图书馆从未考虑过没有找到正则表达式匹配的情况。流氓源代码[AttributeUsageAttribute(AttributeTargets.Property | AttributeTargets.Field | AttributeTargets.Parameter, AllowMultiple = false)] class CheckboxIsCheckedAttribute : RequiredAttribute { public override bool IsValid(Object value) { Boolean isRequiredValid = base.IsValid( value ); if( !isRequiredValid ) return false; return (value as Boolean) == true; } } 中的行是有问题的。如果您在documentation中查看,则表示群组功能id = match.groups(0)[0]。因为找不到匹配项,所以返回一个空元组,并且代码试图从空元组中获取第一项,这会导致错误。