替换熊猫列中的字符串

时间:2019-07-15 01:17:36

标签: python python-3.x string pandas text

背景

我有一个以下示例df,它在PHYSICIAN列中包含Text,后跟医生的名字(下面的所有名字都组成了)

import pandas as pd
df = pd.DataFrame({'Text' : ['PHYSICIAN: Jon J Smith was here today', 
                                   'And Mary Lisa Rider found here', 
                                   'Her PHYSICIAN: Jane A Doe is also here',
                                ' She was seen by  PHYSICIAN: Tom Tucker '], 

                      'P_ID': [1,2,3,4],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     })

#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df

                                     Text         N_ID  P_ID
0   PHYSICIAN: Jon J Smith was here today           A1  1
1   And Mary Lisa Rider found here                  A2  2
2   Her PHYSICIAN: Jane A Doe is also here          A3  3
3   She was seen by PHYSICIAN: Tom Tucker           A4  4

目标

1)用PHYSICIAN替换PHYSICIAN: Jon J Smith后面的名称(例如PHYSICIAN: **PHI**

2)创建一个名为Text_Phys

的新列

所需的输出

                                  Text            N_ID P_ID  Text_Phys
0   PHYSICIAN: Jon J Smith was here today           A1  1   PHYSICIAN: **PHI** was here today
1   And Mary Lisa Rider found here                  A2  2   And Mary Lisa Rider found here
2   Her PHYSICIAN: Jane A Doe is also here          A3  3   Her PHYSICIAN: **PHI** is also here
3   She was seen by PHYSICIAN: Tom Tucker           A4  4   She was seen by PHYSICIAN: **PHI**

我尝试了以下

1)df['Text_Phys'] = df['Text'].replace(r'MRN.*', 'MRN: ***PHI***', regex=True)

2)df['Text_Phys'] = df['Text'].replace(r'MRN\s+', 'MRN: ***PHI***', regex=True)

但是他们似乎不太奏效

问题

如何实现所需的输出?

1 个答案:

答案 0 :(得分:2)

  

尝试以下操作:使用正则表达式定义要匹配的单词以及在何处   您想停止搜索(您可以生成所有单词的列表   发生在“ **”之后,以使代码进一步自动化)。而不是   为了节省时间,我做了一个快速的硬代码“ Found | was | is”。

enter image description here

以下代码:

import pandas as pd
df = pd.DataFrame({'Text' : ['PHYSICIAN: Jon J Smith was here today', 
                                   'And his Physician: Mary Lisa Rider found here', 
                                   'Her PHYSICIAN: Jane A Doe is also here',
                                ' She was seen by  PHYSICIAN: Tom Tucker '], 

                      'P_ID': [1,2,3,4],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     })

df = df[['Text','N_ID', 'P_ID']]
df
    Text    N_ID    P_ID
0   PHYSICIAN: Jon J Smith was here today   A1  1
1   And his Physician: Mary Lisa Rider found here   A2  2
2   Her PHYSICIAN: Jane A Doe is also here  A3  3
3   She was seen by PHYSICIAN: Tom Tucker   A4  4

word_before = r'PHYSICIAN:'
words_after = r'.*?(?=found |was |is )'
words_all =r'PHYSICIAN:[\w\s]+'

import re

pattern = re.compile(word_before+words_after, re.IGNORECASE)
pattern2 = re.compile(words_all, re.IGNORECASE)

for i in range(len(df['Text'])):
    df.iloc[i,0] = re.sub(pattern,"PHYSICIAN: **PHI** ", df["Text"][i])
    if 'PHYSICIAN: **PHI**' not in df.iloc[i,0]:
        df.iloc[i,0] = re.sub(pattern2,"PHYSICIAN: **PHI** ", df["Text"][i])

df
    Text    N_ID    P_ID
0   PHYSICIAN: **PHI** was here today   A1  1
1   And his PHYSICIAN: **PHI** found here   A2  2
2   Her PHYSICIAN: **PHI** is also here A3  3
3   She was seen by PHYSICIAN: **PHI**  A4  4