我有这个数据框,如果我有一个重复的行,如果它们完全相同(Mercedes exp)我只保留一个(不求和)或者求和(kia case)如果租金/售价存在差异
Df 示例
cars rent sale
Kia 1 2
Bmw 1 4
Mercedes 2 1
Ford 1 1
Kia 4 5
Mercedes 2 1
我写了这段代码:
import pandas as pd
df=pd.DataFrame({'cars':['Kia','Bmw','Mercedes','Ford','Kia','Mercedes'],
'rent':[1,1,2,1,4,2],
'sale':[2,4,1,1,5,1]})
df=df.groupby(['cars']).sum().reset_index()
print(df)
我得到了这个输出:
cars rent sale
0 Bmw 1 4
1 Ford 1 1
2 Kia 5 7
3 Mercedes 4 2
预期输出:
cars rent sale
0 Kia 5 7
1 Bmw 1 4
2 Mercedes 2 1
3 Ford 1 1
答案 0 :(得分:2)
在聚合 sum
之前使用 DataFrame.drop_duplicates
- 这会在所有列中一起查找重复项:
df1 = df.drop_duplicates().groupby('cars', sort=False, as_index=False).sum()
print(df1)
cars rent sale
0 Kia 5 7
1 Bmw 1 4
2 Mercedes 2 1
3 Ford 1 1
如果需要指定用于检查重复项的列:
df1 = (df.drop_duplicates(['cars','rent','sale'])
.groupby('cars', sort=False, as_index=False)
.sum())
但是如果需要为每列单独删除重复项,请使用带有 np.unique
和 sum
的 lambda 函数:
df=pd.DataFrame({'cars':['Kia','Bmw','Mercedes','Ford','Kia','Mercedes'],
'rent':[1,1,2,1,4,2],
'sale':[2,4,1,1,5,5]})
print(df)
cars rent sale
0 Kia 1 2
1 Bmw 1 4
2 Mercedes 2 1
3 Ford 1 1
4 Kia 4 5
5 Mercedes 2 5 <- changed 5
df2 = df.groupby('cars', sort=False, as_index=False).agg(lambda x: np.unique(x).sum())
print(df2)
cars rent sale
0 Kia 5 7
1 Bmw 1 4
2 Mercedes 2 6
3 Ford 1 1
答案 1 :(得分:0)
df['duplicated']=df.duplicated() # create a column with the info of duplicating
row or not.
df = df[~df['duplicated'].isin([True])] # delete duplicated row.
df.drop('duplicated', inplace=True, axis=1) # delete the column that we added.
df=df.groupby(['cars'], sort=False).sum().reset_index() # group the dataframe.
你也可以这样做