Question

我需要编写一个程序来识别医疗记录中的姓名。如何替换COULD包含前缀，后缀和首字母或名字的名称，但不必每次都具有上述所有内容。例如，我可以通过该程序来识别史密斯博士，但不是史密斯博士。

谢谢！

这是我到目前为止的程序：

# This program removes names and email addresses occurring in a given input file and saves it in an output file.

import re
def deidentify():
    infilename = input("Give the input file name: ")
    outfilename = input("Give the output file name: ")

    infile = open(infilename,"r")
    text = infile.read()
    infile.close()

    # replace names
    nameRE = "(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+" 
    deidentified_text = re.sub(nameRE,"**name**",text)



    outfile = open(outfilename,"w")
    print(deidentified_text, file=outfile)
    outfile.close()

deidentify()

Answer 1

中的[A-Z](\.|[a-z]+)字词

"(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+"

正在搜索名字或首字母。您希望此部件是可选的，因此请使用捕获组。

nameRe = "(Ms\.|Mr\.|Dr\.|Prof\.)( [A-Z](\.|[a-z]+))?( [A-Z][a-z]+)"
re.sub(nameRe, r"\1\4" ,text)

中的?

re.sub(nameRe, r"\1\4" ,text)

说＆＃34;这部分是可选的，但即使它是空的，仍然把它当作一个捕获组。＆＃34;

r"\1\4"告诉re.sub使用第一个和第四个捕获组（基本上，捕获组会在您看到(的时候启动。）

Answer 2

尝试以下方法：

((?:Ms\.|Mr\.|Dr\.|Prof\.|Mrs\.) (?:[A-Z](?:\.|(?:[a-z])+) )?[A-Z][a-z]+)

但是，我建议将这个文件解析为Python数据结构（字典，对象等等），然后你可以在打印结果时简单地省略名称，更不用说你可以做的所有其他方便的事情了您的数据是否在Python程序中（例如，该患者是否与我们在一起超过五年？有多少百分比的患者使用信用卡号作为支付信息？）。

Answer 3

原来答案是表达式需要使用\ s来计算空格。一旦输入，程序就可以运行。

使用Python正则表达式替换

3 个答案: