我有20条规则的列表,这些规则可以从句子中提取虚假的三字母组。
大块可以是pos-tag的三字母组合:-
示例输入:
"Education of children was our revenue earning secondary business."
所需的输出:
["Education of children","earning secondary business"]
由于数据集非常大,我已经尝试过spacy Matcher,并且需要比运行for循环更优化的东西。
答案 0 :(得分:1)
我认为您正在寻找rule-based matching。您的代码将类似于:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
list_of_rules = [
["VERB", "ADJ", "NOUN"],
["NOUN", "VERB", "ADV"],
["NOUN", "ADP", "NOUN"],
# more rules here...
]
rules = [[{"POS": i} for i in j] for j in list_of_rules]
matcher = Matcher(nlp.vocab)
matcher.add("rules", None, *rules)
doc = nlp("Education of children was our revenue earning secondary business.")
matches = matcher(doc)
print([doc[start:end].text for _, start, end in matches])
将打印
['Education of children', 'earning secondary business']