Question

我正在使用spacy解析文档，但是不幸的是，我无法按照我期望的方式来处理名词块。下面是我的代码：

# Import spacy
import spacy
nlp = spacy.load("en_core_web_lg")

# Add noun chunking to the pipeline
merge_noun_chunks = nlp.create_pipe("merge_noun_chunks")
nlp.add_pipe(merge_noun_chunks)

# Process the document
docs = nlp.pipe(["The big dogs chased the fast cat"])

# Print out the tokens
for doc in docs:
    for token in doc:
        print("text: {}, lemma: {}, pos: {}, tag: {}, dep: {}".format(tname, token.text, token.lemma_, token.pos_, token.tag_, token.dep_))

我得到的输出如下：

text: The big dogs, lemma: the, pos: NOUN, tag: NNS, dep: nsubj
text: chased, lemma: chase, pos: VERB, tag: VBD, dep: ROOT
text: the fast cat, lemma: the, pos: NOUN, tag: NN, dep: dobj

问题出在输出的第一行，其中以意外的方式解析了“大狗”：它创建了“ the”的“引理”，并表示它是“ NOUN”的“ pos”， “ NNS”的“标记”和“ nsubj”的“ dep”。

我希望获得的输出如下：

text: The big dogs, lemma: the big dog, pos: NOUN, tag: NNS, dep: nsubj
text: chased, lemma: chase, pos: VERB, tag: VBD, dep: ROOT
text: the fast cat, lemma: the fast cat, pos: NOUN, tag: NN, dep: dobj

我希望“ lemma”是短语“ the big dog”，复数形式更改为单数，并且该短语将是“ pos”的“ NOUN”，“ tag”的“ NNS”和“ dep”的“ nsubj”。

这是正确的行为吗？还是我错误地使用了spacy？如果我使用的spacy不正确，请告诉我执行此任务的正确方法。

Answer 1

这里有几件事要考虑

合法化是基于令牌的
POS标记和依赖项解析是可预测的

如果您为每个令牌使用the big dog属性，则可能会得到lemma_。使用该属性不会更新令牌位置。

而且，由于依赖项解析和POS标记是在预测模型中进行训练的，因此从人类语言的角度来看，它不能保证始终是“正确的”。

除了引理问题之外，您似乎在使用正确的拼写方式

spacy名词块会产生意外的引理，pos，标记和dep

1 个答案: