如何清除熊猫细胞中的重复数据?

时间:2019-02-18 10:00:15

标签: python pandas

我有一个数据框,其中列性别在单元格中重复,这是一个示例:

1. Male
2. Female, female
3. Female, female , Female, female 

2 个答案:

答案 0 :(得分:3)

将值转换为小写,然后拆分,转换为set,并在必要时重新加入:

df['new'] = df['col'].apply(lambda x: ', '.join(set(x.lower().split(', '))))
print (df)
                                col     new
1.0                            Male    male
2.0                  Female, female  female
3.0  Female, female, Female, female  female

删除行中不包含,的行的解决方案-这意味着每个单元格有多个值:

print (df)
                              col
1.0                          Male
2.0                Female, female
3.0  Female, male, Female, female

df['new'] = df['col'].apply(lambda x: '&'.join(set(x.lower().split(', '))))
print (df)
                              col          new
1.0                          Male         male
2.0                Female, female       female
3.0  Female, male, Female, female  female&male

df = df[df['new'].str.count('&') == 0]
print (df)
                col     new
1.0            Male    male
2.0  Female, female  female

答案 1 :(得分:1)

您只保留第一个拆分:

df['gender'] = df['gender'].apply(lambda x: x.split(',')[0])

对于在同一单元格内的“男性”和“女性”情况,您可以选择它,或者删除该行,或者确定第一个“性别”是可以的(我的解决方案),或者设置另一个值以供以后标识。但这不是您的第一个需求