如何使用spacy从文本中提取名词短语?
我不是指词性标签。
在文档中,我找不到关于名词短语或常规解析树的任何内容。
答案 0 :(得分:41)
如果您想要基本NP,即没有协调的NP,介词短语或相关子句,您可以在Doc和Span对象上使用noun_chunks迭代器:
>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(u'The cat and the dog sleep in the basket near the door.')
>>> for np in doc.noun_chunks:
>>> np.text
u'The cat'
u'the dog'
u'the basket'
u'the door'
如果您需要其他内容,最好的方法是迭代句子的单词并考虑句法上下文来确定单词是否支配您想要的短语类型。如果是,则产生其子树:
from spacy.symbols import *
np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj]) # Probably others too
def iter_nps(doc):
for word in doc:
if word.dep in np_labels:
yield word.subtree
答案 1 :(得分:1)
如果要更精确地指定要提取的名词短语,则可以使用textacy's matches
函数。您可以传递POS标签的任何组合。例如,
textacy.extract.matches(doc, "POS:ADP POS:DET:? POS:ADJ:? POS:NOUN:+")
将返回所有以介词开头的名词,还可以返回确定词和/或形容词。
Textacy建立在spacy之上,因此它们应该完美地协同工作。
答案 2 :(得分:0)
import spacy
nlp = spacy.load("en_core_web_sm")
doc =nlp('Bananas are an excellent source of potassium.')
for np in doc.noun_chunks:
print(np.text)
'''
Bananas
an excellent source
potassium
'''
for word in doc:
print('word.dep:', word.dep, ' | ', 'word.dep_:', word.dep_)
'''
word.dep: 429 | word.dep_: nsubj
word.dep: 8206900633647566924 | word.dep_: ROOT
word.dep: 415 | word.dep_: det
word.dep: 402 | word.dep_: amod
word.dep: 404 | word.dep_: attr
word.dep: 443 | word.dep_: prep
word.dep: 439 | word.dep_: pobj
word.dep: 445 | word.dep_: punct
'''
from spacy.symbols import *
np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj])
print('np_labels:', np_labels)
'''
np_labels: {416, 422, 429, 430, 439}
'''
https://www.geeksforgeeks.org/use-yield-keyword-instead-return-keyword-python/
def iter_nps(doc):
for word in doc:
if word.dep in np_labels:
yield(word.dep_)
iter_nps(doc)
'''
<generator object iter_nps at 0x7fd7b08b5bd0>
'''
## Modified method:
def iter_nps(doc):
for word in doc:
if word.dep in np_labels:
print(word.text, word.dep_)
iter_nps(doc)
'''
Bananas nsubj
potassium pobj
'''
doc = nlp('BRCA1 is a tumor suppressor protein that functions to maintain genomic stability.')
for np in doc.noun_chunks:
print(np.text)
'''
BRCA1
a tumor suppressor protein
genomic stability
iter_nps(doc)
'''
BRCA1 nsubj
that nsubj
stability dobj
'''
答案 3 :(得分:0)
from spacy.en import English
可能会给您一个错误
没有名为“spacy.en”的模块
所有语言数据已移至 spacy2.0+ 中的子模块 spacy.lang
请使用spacy.lang.en import English
然后按照@syllogism_ 的回答执行所有剩余步骤
答案 4 :(得分:0)
你也可以从这样的句子中得到名词:
import spacy
nlp=spacy.load("en_core_web_sm")
doc=nlp("When Sebastian Thrun started working on self-driving cars at "
"Google in 2007, few people outside of the company took him "
"seriously. “I can tell you very senior CEOs of major American "
"car companies would shake my hand and turn away because I wasn’t "
"worth talking to,” said Thrun, in an interview with Recode earlier "
"this week.")
#doc text is from spacy website
for x in doc :
if x.pos_ == "NOUN" or x.pos_ == "PROPN" or x.pos_=="PRON":
print(x)
# here you can get Nouns, Proper Nouns and Pronouns