背景
我有一个示例df
,其中一个Text
列包含0,1或大于1 MRN
的
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Smith MRN: 1111111 is this here',
'MRN: 1234567 Mary Lisa Rider found here',
'Jane A Doe is also here',
'MRN: 2222222 Tom T Tucker is here MRN: 2222222 too'],
'P_ID': [1,2,3,4],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID']]
df
Text N_ID P_ID
0 Jon J Smith MRN: 1111111 is this here A1 1
1 MRN: 1234567 Mary Lisa Rider found here A2 2
2 Jane A Doe is also here A3 3
3 MRN: 2222222 Tom T Tucker is here MRN: 2222222... A4 4
目标
1)将MRN
列中的Text
数字(例如MRN: 1111111
)更改为MRN: **PHI**
2)创建一个包含此输出的新列Text_MRN
所需的输出
Text N_ID P_ID Text_MRN
0 Jon J Smith MRN: 1111111 is this here A1 1 Jon J Smith MRN: **PHI** is this here
1 MRN: 1234567 Mary Lisa Rider found here A2 2 MRN: **PHI** Mary Lisa Rider found here
2 Jane A Doe is also here A3 3 Jane A Doe is also here
3 MRN: 2222222 Tom T Tucker is here MRN: 2222222 A4 4 MRN: **PHI** Tom T Tucker is here MRN: **PHI**
问题
如何实现所需的输出?
答案 0 :(得分:2)
如果要替换所有数字,则可以执行以下操作:
df['Text_MRN'] = df['Text'].replace(r'\d+', '***PHI***', regex=True)
但是,如果您想更具体一些,并且只替换MRN:
之后的数字,则可以使用以下方法:
df['Text_MRN'] = df['Text'].replace(r'MRN: \d+', 'MRN: ***PHI***', regex=True)
给你
df
Text P_ID N_ID Text_MRN
0 Jon J Smith MRN: 1111111 is this here 1 A1 Jon J Smith MRN: ***PHI*** is this here
1 MRN: 1234567 Mary Lisa Rider found here 2 A2 MRN: ***PHI*** Mary Lisa Rider found here
2 Jane A Doe is also here 3 A3 Jane A Doe is also here
3 MRN: 2222222 Tom T Tucker is here MRN: 2222222... 4 A4 MRN: ***PHI*** Tom T Tucker is here MRN: ***PH...
作为正则表达式,\d+
的意思是“匹配一个或多个连续数字”,因此在replace
中使用它表示“用***PHI***
替换一个或多个连续数字”