任何人都可以帮忙。
我正在尝试使用Spacy对文档进行标记化,从而对命名实体进行标记化。例如:
“纽约是美利坚合众国的城市”
将被标记为:
[“纽约”,“是”,“一个”,“城市”,“在”,“该”,“美利坚合众国”]
非常欢迎您提供任何有关此操作的提示。看过使用span.merge(),但没有成功,但是我是编码新手,所以很可能会错过一些东西。
提前谢谢
答案 0 :(得分:1)
更新:Spacy v3 包含了所需的功能。
我还撤回了我对“总统选举”(如下答案所介绍)的期望仍然是一个实体(但这可能取决于您的用例)
from pprint import pprint
import spacy
from spacy.language import Language
# Disabling components not needed (optional, but useful if run on a large dataset)
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "parser", "senter", "lemmatizer", "tagger", "attribute_ruler"])
nlp.add_pipe("merge_noun_chunks")
nlp.add_pipe("merge_entities")
print('Pipeline components included: ', nlp.pipe_names)
example_text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."
print('Tokens: ')
pprint([
token.text
for token in nlp(example_text)
# Including some of the conditions I find useful
if not (
token.is_punct
or
token.is_space
)
])
继续阅读原始答案:
我想我已经重新解释了为 spaCy v3 工作的公认答案。它似乎至少产生相同的输出。
在侧面笔记中,我注意到它分手了一个短语,如“总统选民”到3.我正在修理原始答案的榜样,包括在希望有人与主要发表评论(或新答案的希望中)
import spacy
from spacy.language import Language
class EntityRetokenizeComponent:
def __init__(self, nlp):
pass
def __call__(self, doc):
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
return doc
@Language.factory("entity_retokenizer_component")
def create_entity_retokenizer_component(nlp, name):
return EntityRetokenizeComponent(nlp)
nlp = spacy.load("en_core_web_sm") # You might want to load something else here, see docs
nlp.add_pipe("entity_retokenizer_component", name='merge_phrases', last=True)
text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."
tokened_text = [token.text for token in nlp(text)]
print(tokened_text)
# ['German',
# 'Chancellor',
# 'Angela Merkel',
# 'and',
# 'US',
# 'President',
# 'Barack Obama',
# '(',
# 'Donald Trump',
# 'as',
# 'president',
# '-',
# 'elect',
# ')',
# 'converse',
# 'in',
# 'the Oval Office',
# 'inside',
# 'the White House',
# 'in',
# 'Washington',
# ',',
# 'D.C.']
答案 1 :(得分:0)
使用doc.retokenize
上下文管理器将实体范围合并为单个标记。将其包装在custom pipeline component中,然后将该组件添加到您的语言模型中。
import spacy
class EntityRetokenizeComponent:
def __init__(self, nlp):
pass
def __call__(self, doc):
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
return doc
nlp = spacy.load('en')
retokenizer = EntityRetokenizeComponent(nlp)
nlp.add_pipe(retokenizer, name='merge_phrases', last=True)
doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
"converse in the Oval Office inside the White House in Washington, D.C.")
[tok for tok in doc]
#[German,
# Chancellor,
# Angela Merkel,
# and,
# US,
# President,
# Barack Obama,
# converse,
# in,
# the Oval Office,
# inside,
# the White House,
# in,
# Washington,
# ,,
# D.C.]