Question

我有20条规则的列表，这些规则可以从句子中提取虚假的三字母组。

大块可以是pos-tag的三字母组合：-

规则1：[VERB，ADJ，NOUN]
规则2：[名词，动词，ADV]
规则3：[名词，ADP，名词]等

示例输入：

"Education of children was our revenue earning secondary business."

所需的输出：

["Education of children","earning secondary business"]

由于数据集非常大，我已经尝试过spacy Matcher，并且需要比运行for循环更优化的东西。

Answer 1

我认为您正在寻找rule-based matching。您的代码将类似于：

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

list_of_rules = [
    ["VERB", "ADJ", "NOUN"],
    ["NOUN", "VERB", "ADV"],
    ["NOUN", "ADP", "NOUN"],
    # more rules here...
]

rules = [[{"POS": i} for i in j] for j in list_of_rules]

matcher = Matcher(nlp.vocab)
matcher.add("rules", None, *rules)

doc = nlp("Education of children was our revenue earning secondary business.")
matches = matcher(doc)
print([doc[start:end].text for _, start, end in matches])

将打印

['Education of children', 'earning secondary business']

基于POS标签的N-gram：Spacy

1 个答案: