Question

在学习文本挖掘的基础知识时，我遇到了以下问题：我必须使用命名实体注释来查找和定位命名实体。但是，找到后，标签必须包含在文档中。因此，例如：“你好我是科恩”必须导致“你好我是科恩。

我想出了如何查找和标记命名实体的方法，但是我一直坚持以正确的方式将它们保存在文件中。我试过比较ent.orth_是否在文件中，然后将其替换为标记+ ent.orth_ +结束标记。

print([(X, X.ent_iob_, X.ent_type_) for X in doc])

我用它来定位实体的位置以及它们的起始位置。

for ent in doc.ents:
    entities.append(ent.orth_ + ", " + ent.label_)

我用它来创建具有原始格式和标签的变量。

现在，我拥有所有原始表单和标签的变量，并且知道实体的开始和结束位置。但是，当尝试以某种方式替换它时，我的知识不足，找不到任何类似的例子。

Answer 1

尝试一下：

import spacy

nlp = spacy.load("en_core_web_sm")
s ="Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(s)

def replaceSubstring(s, replacement, position, length_of_replaced):
    s = s[:position] + replacement + s[position+length_of_replaced:]
    return(s)

for ent in reversed(doc.ents):
    #print(ent.text, ent.start_char, ent.end_char, ent.label_)
    replacement = "<{}>{}</{}>".format(ent.label_,ent.text, ent.label_)
    position = ent.start_char
    length_of_replaced = ent.end_char - ent.start_char 
    s = replaceSubstring(s, replacement, position, length_of_replaced)

print(s)
#<ORG>Apple</ORG> is looking at buying <GPE>U.K.</GPE> startup for <MONEY>$1 billion</MONEY>

使用命名实体注释将标签合并到我的文件中

1 个答案: