我一直在使用Spacy提取名词块提取,使用Spacy提供的Doc.noun_chunks属性。 如何使用Spacy库(“VERB?ADV * VERB +”形式)从输入文本中提取动词短语?
答案 0 :(得分:9)
这可能会对你有帮助。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
输出:
is writing
如何突出显示动词短语,请查看以下链接。
答案 1 :(得分:3)
以上答案引用了textacy
,Spacy
可以直接与Matcher一起使用,而无需包装库。
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm') # download model first
sentence = 'The author was staring pensively as she wrote'
[{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'OP': '*'}, # additional wildcard - match any text in between
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
这将返回一个元组列表,其中包含每个匹配项的ID和起始结束索引 匹配,例如:
[(15658055046270554203, 0, 4),
(15658055046270554203, 1, 4),
(15658055046270554203, 2, 4),
(15658055046270554203, 3, 4),
(15658055046270554203, 0, 8),
(15658055046270554203, 1, 8),
(15658055046270554203, 2, 8),
(15658055046270554203, 3, 8),
(15658055046270554203, 4, 8),
(15658055046270554203, 5, 8),
(15658055046270554203, 6, 8),
(15658055046270554203, 7, 8)]
您可以使用索引将这些匹配项转换为跨度。
spans = [doc[start:end] for _, start, end in matches]
# output
"""
The author was staring
author was staring
was staring
staring
The author was staring pensively as she wrote
author was staring pensively as she wrote
was staring pensively as she wrote
staring pensively as she wrote
pensively as she wrote
as she wrote
she wrote
wrote
"""
注意,我在模式中添加了额外的{'OP': '*'},
,当使用特定POS / DEP指定注释时(即它将匹配任何文本),该模式用作通配符。这在这里很有用,因为问题与动词短语有关-格式VERB,ADV,VERB是一种不寻常的结构(尝试考虑一些示例句子),但是VERB,ADV,[其他文本],VERB很有可能(如示例语句“作者在写作时沉思地凝视着她”)。 (可选)您可以优化模式以使其更加具体(displacy is your friend here)。
另外,由于匹配器的贪婪性,将返回匹配的所有排列。您可以选择使用filter_spans将其减少为最长的形式,以删除重复项或重叠项。
from spacy.util import filter_spans
filter_spans(spans)
# output
[The author was staring pensively as she wrote]