我有一个数据框,其中列性别在单元格中重复,这是一个示例:
1. Male
2. Female, female
3. Female, female , Female, female
答案 0 :(得分:3)
将值转换为小写,然后拆分,转换为set
,并在必要时重新加入:
df['new'] = df['col'].apply(lambda x: ', '.join(set(x.lower().split(', '))))
print (df)
col new
1.0 Male male
2.0 Female, female female
3.0 Female, female, Female, female female
删除行中不包含,
的行的解决方案-这意味着每个单元格有多个值:
print (df)
col
1.0 Male
2.0 Female, female
3.0 Female, male, Female, female
df['new'] = df['col'].apply(lambda x: '&'.join(set(x.lower().split(', '))))
print (df)
col new
1.0 Male male
2.0 Female, female female
3.0 Female, male, Female, female female&male
df = df[df['new'].str.count('&') == 0]
print (df)
col new
1.0 Male male
2.0 Female, female female
答案 1 :(得分:1)
您只保留第一个拆分:
df['gender'] = df['gender'].apply(lambda x: x.split(',')[0])
对于在同一单元格内的“男性”和“女性”情况,您可以选择它,或者删除该行,或者确定第一个“性别”是可以的(我的解决方案),或者设置另一个值以供以后标识。但这不是您的第一个需求