我有一个包含阶段信息的凌乱的字符串变量,我想用更少的组创建一个更整洁的字符串。当前数据框如下所示:
cohort = pd.DataFrame({'stage_group': ['XXX Stage I', 'Stage II XXX', 'Stage III XXX', 'XX Stage IV XXX', 'NA']},index=[1, 2, 3, 4, 5])
我的理想变量是3个级别:I-III阶段,IV阶段和未知:
cohort2 = pd.DataFrame({'stage_group': ['XXX Stage I', 'Stage II XXX', 'Stage III XXX', 'XX Stage IV XXX','NA'],'stage': ['Stage I', 'Stage II', 'Stage III', 'Stage IV', 'Unknown']},index=[1, 2, 3, 4, 5])
我尝试了以下代码,但是它们没有正确分配组(我只有I-III阶段,并且未知)。任何的意见都将会有帮助。
searchfor = ['Stage I', 'Stage II', 'Stage III']
cohort['stage'] = pd.np.where(cohort.stage_group.str.contains('|'.join(searchfor)), "Stage I-III",
pd.np.where(cohort.stage_group.str.contains('Stage IV'), "Stage IV", "Unkown"))
答案 0 :(得分:1)
如果我更改订单,代码对我有用,因为Stage IV
还包含Stage I
,因此必须在Stage IV
之前检查Stage I
import pandas as pd
data = {'stage_group': '''XXX Stage I
Stage II XXX
Stage III XXX
XX Stage IV XXX
NA'''.split('\n')
}
cohort = pd.DataFrame(data)
print(cohort)
searchfor = ['Stage I', 'Stage II', 'Stage III']
cohort['stage'] = pd.np.where(cohort.stage_group.str.contains('Stage IV'), "Stage IV",
pd.np.where( cohort.stage_group.str.contains('|'.join(searchfor)), "Stage I-III", "Unkown"))
print(cohort)
结果
stage_group
0 XXX Stage I
1 Stage II XXX
2 Stage III XXX
3 XX Stage IV XXX
4 NA
stage_group stage
0 XXX Stage I Stage I-III
1 Stage II XXX Stage I-III
2 Stage III XXX Stage I-III
3 XX Stage IV XXX Stage IV
4 NA Unkown