背景
我有一个以下示例df
,它在PHYSICIAN
列中包含Text
,后跟医生的名字(下面的所有名字都组成了)
import pandas as pd
df = pd.DataFrame({'Text' : ['PHYSICIAN: Jon J Smith was here today',
'And Mary Lisa Rider found here',
'Her PHYSICIAN: Jane A Doe is also here',
' She was seen by PHYSICIAN: Tom Tucker '],
'P_ID': [1,2,3,4],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df
Text N_ID P_ID
0 PHYSICIAN: Jon J Smith was here today A1 1
1 And Mary Lisa Rider found here A2 2
2 Her PHYSICIAN: Jane A Doe is also here A3 3
3 She was seen by PHYSICIAN: Tom Tucker A4 4
目标
1)用PHYSICIAN
替换PHYSICIAN: Jon J Smith
后面的名称(例如PHYSICIAN: **PHI**
)
2)创建一个名为Text_Phys
所需的输出
Text N_ID P_ID Text_Phys
0 PHYSICIAN: Jon J Smith was here today A1 1 PHYSICIAN: **PHI** was here today
1 And Mary Lisa Rider found here A2 2 And Mary Lisa Rider found here
2 Her PHYSICIAN: Jane A Doe is also here A3 3 Her PHYSICIAN: **PHI** is also here
3 She was seen by PHYSICIAN: Tom Tucker A4 4 She was seen by PHYSICIAN: **PHI**
我尝试了以下
1)df['Text_Phys'] = df['Text'].replace(r'MRN.*', 'MRN: ***PHI***', regex=True)
2)df['Text_Phys'] = df['Text'].replace(r'MRN\s+', 'MRN: ***PHI***', regex=True)
但是他们似乎不太奏效
问题
如何实现所需的输出?
答案 0 :(得分:2)
尝试以下操作:使用正则表达式定义要匹配的单词以及在何处 您想停止搜索(您可以生成所有单词的列表 发生在“ **”之后,以使代码进一步自动化)。而不是 为了节省时间,我做了一个快速的硬代码“ Found | was | is”。
以下代码:
import pandas as pd
df = pd.DataFrame({'Text' : ['PHYSICIAN: Jon J Smith was here today',
'And his Physician: Mary Lisa Rider found here',
'Her PHYSICIAN: Jane A Doe is also here',
' She was seen by PHYSICIAN: Tom Tucker '],
'P_ID': [1,2,3,4],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
df = df[['Text','N_ID', 'P_ID']]
df
Text N_ID P_ID
0 PHYSICIAN: Jon J Smith was here today A1 1
1 And his Physician: Mary Lisa Rider found here A2 2
2 Her PHYSICIAN: Jane A Doe is also here A3 3
3 She was seen by PHYSICIAN: Tom Tucker A4 4
word_before = r'PHYSICIAN:'
words_after = r'.*?(?=found |was |is )'
words_all =r'PHYSICIAN:[\w\s]+'
import re
pattern = re.compile(word_before+words_after, re.IGNORECASE)
pattern2 = re.compile(words_all, re.IGNORECASE)
for i in range(len(df['Text'])):
df.iloc[i,0] = re.sub(pattern,"PHYSICIAN: **PHI** ", df["Text"][i])
if 'PHYSICIAN: **PHI**' not in df.iloc[i,0]:
df.iloc[i,0] = re.sub(pattern2,"PHYSICIAN: **PHI** ", df["Text"][i])
df
Text N_ID P_ID
0 PHYSICIAN: **PHI** was here today A1 1
1 And his PHYSICIAN: **PHI** found here A2 2
2 Her PHYSICIAN: **PHI** is also here A3 3
3 She was seen by PHYSICIAN: **PHI** A4 4