如何在NLTK中重新格式化Malt Parser的输出?

时间:2014-09-19 20:26:35

标签: python parsing nlp nltk

所以我终于想出了如何使用NLTK中提供的麦芽包装“How to use malt parser in python nltk”,并且能够成功地对我的句子进行分块,但我的句子以我不熟悉的格式出现。

例如,解析“这是一个句子”会返回:

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.linear-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "This is a test sentence"
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(This (sentence is a test))

解析一个更复杂的句子:

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.linear-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "A ceasefire for east Ukraine has been agreed during talks in Minsk."
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(agreed
   (ceasefire A (for (Ukraine east)))
   has
   been
   (during (talks (in Minsk)))
   .)

有人可以解释一下这种输出格式是什么,或者我如何解析它使它看起来像原始句子:

(This (is a test sentence))
A (ceasefire (for (east Ukraine))) has been (agreed (during (talks (in Minsk))).)

如果有帮助,graph是一个nltk DependencyGraph,graph.tree()是一个nltk树。

提前致谢。

1 个答案:

答案 0 :(得分:1)

MaltParser是一个用于数据驱动的“依赖性解析”的系统,可用于从树库数据中引出解析模型,并使用诱导模型解析新数据。

文件engmalt.poly-1.7.mco和engmalt.linear-1.7.mco包含用于使用MaltParser解析英文文本的单一麦芽配置。

这两个模型的不同之处在于engmalt.poly-1.7.mco使用带有多项式核的SVM进行分类,而engmalt.linear-1.7.mco使用线性SVM。虽然后一种解析器要快得多,但前者需要的内存较少,而且两种模型的解析精度相似。还有我们的输出解析文本的写法。

使用engmalt.poly-1.7.mco,输出解析文本在依赖注释/依赖图中表示,其中engmalt.linear-1.7.mco以线性方式表示。

请按照以下输出。希望这会有所帮助。

使用mco =“engmalt.linear-1.7”

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.linear-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "This is a test sentence"
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(This (sentence is a test))

使用mco =“engmalt.poly-1.7”

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.poly-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "This is a test sentence"
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(is This (a (sentence test)))

对于新的复杂句子,使用mco =“engmalt.linear-1.7”

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.linear-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "A ceasefire for east Ukraine has been agreed during talks in Minsk."
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(A\n  (agreed\n    (been ceasefire for east Ukraine has)\n    (during (Minsk talks in)))\n  .)