我有两个数据框。名称,年龄和兴趣是我的专栏,
df1:
Name Age Interest
0 ramesh 1 rugby
1 dhoni 5 coco
2 vir 14 cricket
3 vir 14 cricket
4 vir 14 cricket
5 lee 2 cricket
6 lee 2 cricket
df2:
Name Age Interest
0 abd 3 coco
1 vir 14 cricket
2 vir 14 cricket
3 vir 14 cricket
4 vir 14 cricket
5 vir 14 cricket
6 lee 2 cricket
有多个重复项,我想通过串联df1,df2删除重复项来生成另一个数据帧。但是多余的重复记录也应该出现在结果数据框中。如果df1中有3个相同的行,而df2中有5个相同的行,则在结果数据帧中应出现2个重复项。它不应删除所有重复项。
(result_df) 预期的产量
Name Age Interest
0 ramesh 1 rugby
1 dhoni 5 coco
2 lee 2 cricket
3 abd 3 coco
4 vir 14 cricket
5 vir 14 cricket
(无需考虑结果输出中出现的重复顺序)
我尝试使用drop_duplicates,但是会删除所有重复的行,而使用“ keep”只能保留第一个或最后一个重复值。该怎么办?
删除所有重复项的示例代码
import pandas as pd
data1 = [['ramesh', 1 , 'rugby'], ['dhoni', 5, 'coco'], ['vir', 14, 'cricket'],['vir', 14, 'cricket'],['vir', 14 , 'cricket'],['lee',2 ,'cricket'],['lee',2 ,'cricket'] ]
df1 = pd.DataFrame(data1, columns = ['Name', 'Age' , 'Interest'])
data2 = [['abd', 3, 'coco'], ['vir', 14, 'cricket'],['vir', 14, 'cricket'],['vir', 14 , 'cricket'],['vir', 14 , 'cricket'],['vir', 14 , 'cricket'] , ['lee',2 ,'cricket']]
df2 = pd.DataFrame(data2, columns = ['Name', 'Age' , 'Interest'])
print(df1)
print(df2)
list_df = [df1,df2]
df_concat = pd.concat(list_df)
result_df = df_concat.drop_duplicates(keep = False)
# having value keep = first/last doesn't help
print(result_df)