Question

背景

我正在使用NeuroNER http://neuroner.com/标记文本数据sample_string，如下所示。

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

输出（使用NeuroNER）

我的输出是字典dic_list的列表

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

传奇

id =文本ID

type =被识别的文本类型

start =所标识文本的起始位置

end =所标识文本的结束位置

text =已识别的文本

目标

由于text（例如Jane）的位置由start和end给出，因此我想将{{1 1}}至text在我的列表dic_list

中

所需的输出

**PHI**

问题

我尝试过Replacing a character from a certain index和Edit the values in a list of dictionaries?，但它们并不是我想要的

如何实现所需的输出？

Answer 1

我可能会丢失一些东西，但是您可以使用.replace()：

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

for dic in dic_list:
    sample_string = sample_string.replace(dic['text'], '**PHI**')
print(sample_string)

尽管regex可能会更快：

import re
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},    
 {'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},  
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},   
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

pattern = re.compile('|'.join(dic['text'] for dic in dic_list))
result = pattern.sub('**PHI**', sample_string)
print(result)

两个输出：

Patient **PHI** **PHI** was seen by Dr. **PHI** on **PHI** and her number is **PHI**

Answer 2

如果您想要基于start和end索引的解决方案，

您可以使用dic_list之间的间隔来了解所需的部分。然后与**PHI**一起加入。

尝试一下：

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

parts_to_take = [(0, dic_list[0]['start'])] + [(first["end"]+1, second["start"]) for first, second in zip(dic_list, dic_list[1:])] + [(dic_list[-1]['end'], len(sample_string)-1)]
parts = [sample_string[start:end] for start, end in parts_to_take]

sample_string = '**PHI**'.join(parts)

print(sample_string)

Answer 3

根据@ Error - Syntactical Remorse的建议

sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'

dic_list = [
 {'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
 {'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
 {'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
 {'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
 {'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]

offset = 0
filler = '**PHI**'
for dic in dic_list:
    sample_string = sample_string[:dic['start'] + offset ] + filler + sample_string[dic['end'] + offset + 1:]
    offset += dic['start'] - dic['end'] + len(filler) - 1
print(sample_string)

使用字典列表更改字符串

3 个答案: