我正在使用spaCy's Named Entity Recognition来找出句子中的美食单词。这是我的代码:
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = "I like to eat pizza."
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, ent.label_)
为什么不打印“比萨饼”?根据{{3}},食物属于PRODUCT
实体类型,因此不应为ent.text
打印“比萨饼”,而为PRODUCT
打印ent.label
?
答案 0 :(得分:0)
我遇到了同样的问题,并通过几个例子训练了 spacy。
所以,抓几句(3-4句也行),手动将产品提取到列表中,然后你就会有一个文本字典和产品列表。然后修改这段代码
def getSpans(ner_model=None, products=[], nameForNewLabel = 'PRODUCTS', doc=None):
# create patterns
patterns = [ner_model(products) for products in products]
# matches them, what about overlapping?
matcher = PhraseMatcher(ner_model.vocab)
matcher.add(nameForNewLabel, None, *patterns) # add patterns to matcher
matches = matcher(doc)
# now create spans
spans=[]
for match_id, start, end in matches:
# create a new Span for each match and use the match_id (PRODUCTS) as the label
span = doc[start:end] # The matched span
print(span.text, span.start_char,span.end_char, span.label_, "'"+doc.text[span.start_char:span.end_char]+"'", doc.text[span.start_char:span.end_char] in products)
# now create open span
span = Span(doc, start, end, label=match_id)
# add to spans
spans.append(span)
# filter spans for that company,description of company
# Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or
# when merging spans with Retokenizer.merge. When spans overlap, the (first) longest span is preferred over shorter spans.
filtered_spans = filter_spans(spans)
doc.ents = filtered_spans
#create example and add to dataset list of examples to return
eg=Example(doc,doc)
return eg
哪里
doc = ner_model.make_doc(text)
和
ner_model = spacy.blank('en') # create blank Language class
然后训练模型。一旦训练过,例如使用 batch_size = max(number examples) 的 200 个 epoch,你会看到它会起作用。
我无法分享我的全部代码,因为我将其用于私募股权 AI 公司的产品,但通过上述内容,我相信您可以做到。