我有一组已经匹配的关键字。这是一种医学背景,因此至少在我尝试进行的分析中,我已经提出了等效的方案:
我有一辆带镀铬1000英寸轮辋的汽车。
假设我想返回关键字 rims 的所有子单词/词组作为短语,其中 rims 已被SpaCy标记为实体。
>在python中,这就是我正在做的事情:
test_phrases = nlp("""I have a car with chrome 100-inch rims.""")
print(test_phrases.cats)
for t in test_phrases:
print('Token: {} || POS: {} || DEP: {} CHILDREN: {} || ent_type: {}'.format(t,t.pos_,t.dep_,[c for c in t.children],t.ent_type_))
Token: I || POS: PRON || DEP: nsubj CHILDREN: [] || ent_type:
Token: have || POS: VERB || DEP: ROOT CHILDREN: [I, car, .] ||
ent_type:
Token: a || POS: DET || DEP: det CHILDREN: [] || ent_type:
Token: car || POS: NOUN || DEP: dobj CHILDREN: [a, with] || ent_type:
Token: with || POS: ADP || DEP: prep CHILDREN: [rims] || ent_type:
Token: chrome || POS: ADJ || DEP: amod CHILDREN: [] || ent_type:
Token: 100-inch || POS: NOUN || DEP: compound CHILDREN: [] || ent_type:
Token: rims || POS: NOUN || DEP: pobj CHILDREN: [chrome, 100-inch] ||
ent_type:
Token: . || POS: PUNCT || DEP: punct CHILDREN: [] || ent_type: CARPART
所以,我要使用的东西是这样的:
test_matcher = Matcher(nlp.vocab)
test_phrase = ['']
patterns = [[{'ENT':'CARPART',????}] for kp in test_phrase]
test_matcher.add('CARPHRASE', None, *patterns)
调用 test_doc 上的 test_matcher 使其返回:
chrome 100-inch rims
答案 0 :(得分:0)
我认为我找到了一个令人满意的解决方案,该解决方案在创建Spacy类对象时将起作用。您可以对其进行测试,以确保它可以与您的解决方案一起使用,然后在Spacy管道中添加到类似this的地方:
from spacy.matcher import Matcher
keyword_list = ['rims']
patterns = [[{'LOWER':kw}] for kw in keyword_list]
test_matcher.add('TESTPHRASE',None, *patterns)
def add_children_matches(doc,keyword_matcher):
'''Add children to match on original single-token keyword.'''
matches = keyword_matcher(doc)
for match_id, start, end in matches:
tokens = doc[start:end]
print('keyword:',tokens)
# Since we are getting children for keyword, there should only be one token
if len(tokens) != 1:
print('Skipping {}. Too many tokens to match.'.format(tokens))
continue
keyword_token = tokens[0]
sorted_children = sorted([c.i for c in keyword_token.children] + [keyword_token.i],reverse=False)
print('keyphrase:',doc[min(sorted_children):max(sorted_children)+1])
doc = nlp("""I have a car with chrome 1000-inch rims.""")
add_children_matches(doc,test_matcher)
这给出了:
keyword: rims
keyphrase: chrome 1000-inch rims
编辑:要完全回答我自己的问题,您必须使用类似以下内容的
: def add_children_matches(doc,keyword_matcher):
'''Add children to match on original single-token keyword.'''
matches = keyword_matcher(doc)
spans = []
for match_id, start, end in matches:
tokens = doc[start:end]
print('keyword:',tokens)
# Since we are getting children for keyword, there should only be one token
if len(tokens) != 1:
print('Skipping {}. Too many tokens to match.'.format(tokens))
continue
keyword_token = tokens[0]
sorted_children = sorted([c.i for c in keyword_token.children] + [keyword_token.i],reverse=False)
print('keyphrase:',doc[min(sorted_children):max(sorted_children)+1])
start_char = doc[min(sorted_children):max(sorted_children)+1].start_char
end_char = doc[min(sorted_children):max(sorted_children)+1].end_char
span = doc.char_span(start_char, end_char,label='CARPHRASE')
if span != None:
spans.append(span)
return doc