Question

我正在处理一个我必须预处理的数据集。我想用它们唯一的ID替换所有出现的事件（由开始和结束索引给出）。

给出一串文字，如：

s = "The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence naloxone-suppressible binding of [3H]-dihydroergocryptine (1 nM)."

和一系列词典，如：

[

＆＃39; D006973＆＃39;：[{＆＃39;长度＆＃39;：＆＃39; 12＆＃39;，＆＃39; offset＆＃39;：＆＃39; 199＆＃39;，＆＃39;文字＆＃39;：[＆＃39;高血压＆＃39;]，＆＃39;输入疾病＆＃39;}，

＆＃39; D008750＆＃39;：[{＆＃39;长度＆＃39;：＆＃39; 16＆＃39;，＆＃39; offset＆＃39;：＆＃39; 36＆＃39;，＆＃39; text＆＃39;：[＆＃39; alpha-methyldopa＆＃39;]，＆＃39;键入＆＃39;：＆＃39;化学＆＃39;}]，

＆＃39; D007022＆＃39;：[{＆＃39; length＆＃39;：＆＃39; 11＆＃39;，＆＃39; offset＆＃39;：＆＃39; 4＆＃39;，＆＃39; text＆＃39;：[＆＃39; hypotensive＆＃39;]，＆＃39;输入疾病＆＃39;}，

＆＃39; D009270＆＃39;：[{＆＃39; length＆＃39;：＆＃39; 8＆＃39;，＆＃39; offset＆＃39;：＆＃39; 84＆＃39;，＆＃39;文字＆＃39;：[＆＃39;纳洛酮＆＃39;]，＆＃39;输入＆＃39;：＆＃39;化学品＆＃39;}， {＆＃39;长度＆＃39;：＆＃39; 8＆＃39;，＆＃39; offset＆＃39;：＆＃39; 94＆＃39;，＆＃39;文字＆＃39;：[＆＃39;纳洛酮＆＃39;]，＆＃39;输入＆＃39;：＆＃39;化学品＆＃39;}， {＆＃39;长度＆＃39;：＆＃39; 13＆＃39;，＆＃39;抵消＆＃39;：＆＃39; 293＆＃39;，＆＃39; text＆＃39;：[＆＃34; [3H] -naloxone＆＃34;]，＆＃39;输入＆＃39;：＆＃39;化学品＆＃39;}]

我想将偏移量给出的所有出现次数替换为各自的ID。因此，对于最后一个字典，我希望列表中的所有值都被＆＃39; D009270＆＃39;替换。

示例1：对于带有键＆＃39; D006973＆＃39;的第一个词典，我想要替换出现在索引199并且长度为12的高血压＆＃39;其中＆＃39; D006973＆＃39 ;.

示例2：对于带有键＆＃39; D009270＆＃39;的最后一个字典，我想从索引替换子字符串（由元组给出）

[(84, 92), (94, 102), (293, 306)]

在最后一句中，纳洛酮与＆＃34; 纳洛酮可抑制＆＃34;同时存在，但我不想替换它。所以我不能简单地使用str.replace()。
我使用其唯一ID将字符串从起始索引替换为结束索引（例如：199到211，高血压＆＃39;）。但这会扰乱其他尚未被替换的补偿。实体。我可以使用填充来替换要替换的文本（＆＃39; D006973＆＃39;）小于字符串（＆＃39;高血压＆＃39;）。但是当要重新调整的文本的大小更大时，它将会失败。

Answer 1

您可以将字符串格式化程序与占位符字符一起使用：

from operator import itemgetter

s = "The hypotensive effect of 100 mg/kg alpha-methyldopa was also partially reversed by naloxone. Naloxone alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously hypertensive rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of [3H]-naloxone (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence naloxone-suppressible binding of [3H]-dihydroergocryptine (1 nM)."

dictionary={
'D006973': [{'length': '12', 'offset': '199', 'text': ['hypertensive'], 'type': 'Disease'}],
'D008750': [{'length': '16', 'offset': '36', 'text': ['alpha-methyldopa'], 'type': 'Chemical'}],
'D007022': [{'length': '11', 'offset': '4', 'text': ['hypotensive'], 'type': 'Disease'}],
'D009270': [{'length': '8', 'offset': '84', 'text': ['naloxone'], 'type': 'Chemical'}, {'length': '8', 'offset': '94', 'text': ['Naloxone'], 'type': 'Chemical'}, {'length': '13', 'offset': '293', 'text': ["[3H]-naloxone"], 'type': 'Chemical'}]
}

index_list=[]
for key in dictionary:
    for dic in dictionary[key]:
        o=int(dic['offset'])
        index_tuple=o , o+int(dic['length']),key
        index_list.append(index_tuple)

index_list.sort(key=itemgetter(0))
format_list=[]
lt=list(s)
for i,j in enumerate(index_list):
    si=j[0]
    ei=j[1]
    lt[si:ei]=list("{}") + ["@"]*((ei-si)-2)
    format_list.append(j[2])

text = "".join(lt)
text = text.replace("@","")
text = text.format(*format_list)

结果：'The D007022 effect of 100 mg/kg D008750 was also partially reversed by D009270. D009270 alone did not affect either blood pressure or heart rate. In brain membranes from spontaneously D006973 rats clonidine, 10(-8) to 10(-5) M, did not influence stereoselective binding of D009270 (8 nM), and naloxone, 10(-8) to 10(-4) M, did not influence naloxone-suppressible binding of [3H]-dihydroergocryptine (1 nM).'

使用Python

1 个答案: