Question

我是Spacy的新手，我想提取所有＆＃34;句子中的名词短语。我想知道如何做到这一点。我有以下代码：

import spacy

nlp = spacy.load("en")

file = open("E:/test.txt", "r")
doc = nlp(file.read())
for np in doc.noun_chunks:
    print(np.text)

但它只返回基本名词短语，即不包含任何其他NP的短语。也就是说，对于以下短语，我得到以下结果：

短语：We try to explicitly describe the geometry of the edges of the images.

结果：We, the geometry, the edges, the images。

预期结果：We, the geometry, the edges, the images, the geometry of the edges of the images, the edges of the images.

如何获取所有名词短语，包括嵌套短语？

Answer 1

请参阅下面的注释代码以递归方式组合名词。代码受Spacy Docs here

启发

import spacy

nlp = spacy.load("en")

doc = nlp("We try to explicitly describe the geometry of the edges of the images.")

for np in doc.noun_chunks: # use np instead of np.text
    print(np)

print()

# code to recursively combine nouns
# 'We' is actually a pronoun but included in your question
# hence the token.pos_ == "PRON" part in the last if statement
# suggest you extract PRON separately like the noun-chunks above

index = 0
nounIndices = []
for token in doc:
    # print(token.text, token.pos_, token.dep_, token.head.text)
    if token.pos_ == 'NOUN':
        nounIndices.append(index)
    index = index + 1


print(nounIndices)
for idxValue in nounIndices:
    doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
    span = doc[doc[idxValue].left_edge.i : doc[idxValue].right_edge.i+1]
    span.merge()

    for token in doc:
        if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.pos_ == "PRON":
            print(token.text)

Answer 2

请尝试从文本中获取所有名词：

import spacy
nlp = spacy.load("en_core_web_sm")
text = ("We try to explicitly describe the geometry of the edges of the images.")
doc = nlp(text)
print([chunk.text for chunk in doc.noun_chunks])

Answer 3

对于每个名词块，您还可以获取其下面的子树。 Spacy提供了两种访问方法：left_edge和right edge属性，以及 subtree属性，它返回一个Token迭代器而不是跨度。将noun_chunks与它们的子树组合会导致某些重复，可以在以后删除。

以下是使用left_edge和right edge属性的示例

{np.text
  for nc in doc.noun_chunks
  for np in [
    nc, 
    doc[
      nc.root.left_edge.i
      :nc.root.right_edge.i+1]]}                                                                                                                                                                                                                                                                                                                                                                                                                                                 

==>

{'We',
 'the edges',
 'the edges of the images',
 'the geometry',
 'the geometry of the edges of the images',
 'the images'}

如何在Spacy中获取所有名词短语

3 个答案: