为什么将WDT单词通过依存关系分析标记为句子主题?

时间:2019-02-28 21:22:46

标签: python nlp spacy

我想报告每个句子的主题;并提取其所有修饰符。 (例如,“唐纳德·特朗普”不仅是“特朗普”;“平均剩余租赁期限”还不仅仅是“期限”。)

这是我的测试代码:

import spacy

nlp = spacy.load('en_core_web_sm')

def handle(doc):
    for sent in doc.sents:
        shownSentence = False
        for token in sent:
            if(token.dep_=="nsubj"):
                if(not shownSentence):
                    print("----------")
                    print(sent)
                    shownSentence = True
                print("{0}/{1}".format(token.text, token.tag_))
                print([ [t,t.tag_] for t in token.children])

handle(nlp('Donald Trump, legend in his own lifetime, said: "This transaction is a continuation of our main strategy to invest in assets which offer growth potential and that are coloured pink." The average remaining lease term is six years, and Laura Palmer was killed by Bob. Trump added he will sell up soon.'))

输出如下。我想知道为什么要以“哪个/ WDT”为主题?仅仅是模型噪声,还是被认为是正确的行为? (顺便说一句,在我的具有相同结构的真实句子中,我也将“ / WDT”标记为主题。)(更新:如果我切换到“ en_core_web_md”,则我在我的特朗普示例中得到“ that / WDT”;这是从小型模型转换为中型模型的唯一区别。)

我可以通过查看tag_轻松地将它们过滤掉;我对根本原因更感兴趣。

更新:顺便说一下,此代码不会将“ Laura Palmer”作为主题移出,因为dep_的值是“ nsubjpass”,而不是“ nsubj”。 )

----------
Donald Trump, legend in his own lifetime, said: "This transaction is a continuation of our main strategy to invest in assets which offer growth potential and that are coloured pink."
Trump/NNP
[[Donald, 'NNP'], [,, ','], [legend, 'NN'], [,, ',']]
transaction/NN
[[This, 'DT']]
which/WDT
[]
----------
The average remaining lease term is six years, and Laura Palmer was killed by Bob.
term/NN
[[The, 'DT'], [average, 'JJ'], [remaining, 'JJ'], [lease, 'NN']]
----------
Trump added he will sell up soon.
Trump/NNP
[]
he/PRP
[]

(顺便说一下,更大的图景:代词解析。我想将PRP转换为它们所引用的文本。)

0 个答案:

没有答案