Question

背景

我有一个示例df，其中一个Text列包含0,1或大于1 MRN的

import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Smith  MRN: 1111111 is this here', 
                                   'MRN: 1234567 Mary Lisa Rider found here', 
                                   'Jane A Doe is also here',
                                'MRN: 2222222 Tom T Tucker is here MRN: 2222222 too'], 

                      'P_ID': [1,2,3,4],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     })

#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df

                            Text                      N_ID  P_ID
0   Jon J Smith MRN: 1111111 is this here               A1  1
1   MRN: 1234567 Mary Lisa Rider found here             A2  2
2   Jane A Doe is also here                             A3  3
3   MRN: 2222222 Tom T Tucker is here MRN: 2222222...   A4  4

目标

1）将MRN列中的Text数字（例如MRN: 1111111）更改为MRN: **PHI**

2）创建一个包含此输出的新列Text_MRN

所需的输出

                             Text                  N_ID P_ID Text_MRN
0   Jon J Smith MRN: 1111111 is this here          A1   1   Jon J Smith MRN: **PHI** is this here
1   MRN: 1234567 Mary Lisa Rider found here        A2   2   MRN: **PHI** Mary Lisa Rider found here 
2   Jane A Doe is also here                        A3   3   Jane A Doe is also here 
3   MRN: 2222222 Tom T Tucker is here MRN: 2222222 A4   4   MRN: **PHI** Tom T Tucker is here MRN: **PHI**

问题

如何实现所需的输出？

Answer 1

如果要替换所有数字，则可以执行以下操作：

df['Text_MRN'] = df['Text'].replace(r'\d+', '***PHI***', regex=True)

但是，如果您想更具体一些，并且只替换MRN:之后的数字，则可以使用以下方法：

df['Text_MRN'] = df['Text'].replace(r'MRN: \d+', 'MRN: ***PHI***', regex=True)

给你

df
                                                Text  P_ID N_ID                                           Text_MRN
0             Jon J Smith  MRN: 1111111 is this here     1   A1           Jon J Smith  MRN: ***PHI*** is this here
1            MRN: 1234567 Mary Lisa Rider found here     2   A2          MRN: ***PHI*** Mary Lisa Rider found here
2                            Jane A Doe is also here     3   A3                            Jane A Doe is also here
3  MRN: 2222222 Tom T Tucker is here MRN: 2222222...     4   A4  MRN: ***PHI*** Tom T Tucker is here MRN: ***PH...

作为正则表达式，\d+的意思是“匹配一个或多个连续数字”，因此在replace中使用它表示“用***PHI***替换一个或多个连续数字”

更改熊猫列中的数字字符串

1 个答案: