Question

任何人都可以帮忙。

我正在尝试使用Spacy对文档进行标记化，从而对命名实体进行标记化。例如：

“纽约是美利坚合众国的城市”

将被标记为：

[“纽约”，“是”，“一个”，“城市”，“在”，“该”，“美利坚合众国”]

非常欢迎您提供任何有关此操作的提示。看过使用span.merge（），但没有成功，但是我是编码新手，所以很可能会错过一些东西。

提前谢谢

Answer 1

更新：Spacy v3 包含了所需的功能。

我还撤回了我对“总统选举”（如下答案所介绍）的期望仍然是一个实体（但这可能取决于您的用例）

from pprint import pprint

import spacy
from spacy.language import Language


# Disabling components not needed (optional, but useful if run on a large dataset)
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "parser", "senter", "lemmatizer", "tagger", "attribute_ruler"])
nlp.add_pipe("merge_noun_chunks")
nlp.add_pipe("merge_entities")
print('Pipeline components included: ', nlp.pipe_names)

example_text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."

print('Tokens: ')
pprint([
    token.text
    for token in nlp(example_text)
    # Including some of the conditions I find useful
    if not (
        token.is_punct
        or
        token.is_space
    )
])

继续阅读原始答案：

我想我已经重新解释了为 spaCy v3 工作的公认答案。它似乎至少产生相同的输出。

在侧面笔记中，我注意到它分手了一个短语，如“总统选民”到3.我正在修理原始答案的榜样，包括在希望有人与主要发表评论（或新答案的希望中)

import spacy
from spacy.language import Language


class EntityRetokenizeComponent:
    def __init__(self, nlp):
        pass

    def __call__(self, doc):
        with doc.retokenize() as retokenizer:
            for ent in doc.ents:
                retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
        return doc

@Language.factory("entity_retokenizer_component")
def create_entity_retokenizer_component(nlp, name):
    return EntityRetokenizeComponent(nlp)

nlp = spacy.load("en_core_web_sm")  # You might want to load something else here, see docs
nlp.add_pipe("entity_retokenizer_component", name='merge_phrases', last=True)

text = "German Chancellor Angela Merkel and US President Barack Obama (Donald Trump as president-elect') converse in the Oval Office inside the White House in Washington, D.C."
tokened_text = [token.text for token in nlp(text)]

print(tokened_text)

# ['German',
#  'Chancellor',
#  'Angela Merkel',
#  'and',
#  'US',
#  'President',
#  'Barack Obama',
#  '(',
#  'Donald Trump',
#  'as',
#  'president',
#  '-',
#  'elect',
#  ')',
#  'converse',
#  'in',
#  'the Oval Office',
#  'inside',
#  'the White House',
#  'in',
#  'Washington',
#  ',',
#  'D.C.']

Answer 2

使用doc.retokenize上下文管理器将实体范围合并为单个标记。将其包装在custom pipeline component中，然后将该组件添加到您的语言模型中。

import spacy

class EntityRetokenizeComponent:
  def __init__(self, nlp):
    pass
  def __call__(self, doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": str(doc[ent.start:ent.end])})
    return doc

nlp = spacy.load('en')
retokenizer = EntityRetokenizeComponent(nlp) 
nlp.add_pipe(retokenizer, name='merge_phrases', last=True)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")

[tok for tok in doc]

#[German,
# Chancellor,
# Angela Merkel,
# and,
# US,
# President,
# Barack Obama,
# converse,
# in,
# the Oval Office,
# inside,
# the White House,
# in,
# Washington,
# ,,
# D.C.]

在Spacy中标记命名实体

2 个答案: