如何将合并的spacy ner标签转换为BIO格式?

时间:2020-09-23 03:12:46

标签: python python-3.x nlp spacy ner

如何将其转换为BIO格式?我曾尝试使用spacy biluo_tags_from_offsets,但无法捕获所有实体,我想我知道原因。

tags = biluo_tags_from_offsets(doc, annot['entities'])

BSc(理学学士)-两者结合在一起,但是当有空格时,spacy会拆分文本。所以现在这些单词就像(BSc(Bachelor, of, science,这就是为什么伪造biluo_tags_from_offsets失败并返回-

的原因

现在,当它检查(80, 83, 'Degree')时,找不到单独的BSc单词。同样,它也会因(84, 103, 'Degree')而失败。

如何解决这些情况?请帮助,如果有人可以。


EDUCATION: · Master of Computer Applications (MCA) from NV, *********, *****. · BSc(Bachelor of science) from NV, *********, *****

{'entities': [(13, 44, 'Degree'), (46, 49, 'Degree'), (80, 83, 'Degree'), (84, 103, 'Degree')]}

1 个答案:

答案 0 :(得分:1)

通常,您将数据传递到biluo_tags_from_offsets(doc, entities),其中entities类似于[(14, 44, 'ORG'), (51, 54, 'ORG')]。您可以根据需要编辑此参数(可以从编辑doc.ents开始,也可以从那里继续进行)。您可以添加,删除或合并此列表中的所有实体,如以下示例所示:

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

text = "I have a BSc (Bachelors of Computer Sciences) from NYU"
doc = nlp(text)
print("Entities before adding new entity:", doc.ents)

entities = []
for ent in doc.ents:
    entities.append((ent.start_char, ent.end_char, ent.label_))
print("BILUO before adding new entity:", biluo_tags_from_offsets(doc, entities))

entities.append((9,12,'ORG')) # add a desired entity

print("BILUO after adding new entity:", biluo_tags_from_offsets(doc, entities))

Entities before adding new entity: (Bachelors of Computer Sciences, NYU)
BILUO before adding new entity: ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']
BILUO after adding new entity: ['O', 'O', 'O', 'U-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']

如果您希望合并实体的过程是基于规则的,则可以尝试使用以下简化示例(取自上面的链接)来entityruler

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])

,然后再次将经过重新定义(在您的情况下合并)的实体列表传递给biluo_tags_from_offsets,就像在第一个代码段中一样