Question

我有一个文本和索引条目，其中一些表示文本中出现的重要多字表达式（MWE）（例如生物文本的“海绵骨”）。我想使用这些条目在spaCy中构建自定义匹配器，以便我可以识别文本中MWE的出现。另一个要求是我需要匹配事件来保留MWE组成单词的词形表示和POS标记。

我已经看过现有的spaCy示例做类似的事情，但我似乎无法得到这种模式。

Answer 1

Spacy文档在使用带有多个短语的Matcher类时并不十分清楚，但在Github repo中有一个多短语匹配example。

我最近遇到了同样的挑战，我得到了如下工作。我的文本文件每行包含一条记录，其中包含一个短语，其描述由＆＃39; ::＆＃39;。

消除。

import spacy
import io
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en')
text = nlp(u'Your text here')
rules = list()

# Create a list of tuple of phrase and description from the file
with io.open('textfile','r',encoding='utf8') as doc:
    rules = [tuple(line.rstrip('\n').split('::')) for line in doc]

# convert the phrase string to a spacy doc object 
rules = [(nlp(item[0].lower()),item[-1]) for item in rules ]

# create a dictionary for accessing value using the string as the index which is returned by matcher class
rules_dict = dict()
for key,val in rules:
    rules_dict[key.text]=val

# get just the phrases from rules list
rules_phrases = [item[0] for item in rules]

# match using the PhraseMatcher class
matcher = PhraseMatcher(nlp.vocab,rules_phrases)
matches = matcher(text)
result = list()

for start,end,tag,label,m in matches:
    result.append({"start":start,"end":end,"phrase":label,"desc":rules_dict[label]})
print(result)

Spacy

1 个答案: