Question

我有一个看起来像这样的数据框：

prod_id, prod_name, col_1, col_2, type
101, electronic, 10, 10, old
102, hardware, 2, 4, old
101, electronic, 10, 10, new
102, hardware, 2, 1, new
103, other, 22, 13, new

我正在尝试更新自己的数据框，以便如果所有其他列都相同，则更新的数据框具有type=old的行，否则使用type=new的值

最终输出：

prod_id, prod_name, col_1, col_2, type
101, electronic, 10, 10, old
102, hardware, 2, 1, new
103, other, 22, 13, new

Answer 1

据我了解，您尝试使用2个布尔值掩码，一次检查是否没有重复的值并且类型为new，而另一次则在重复时保持type ='old'，

u = df.drop("type",1)
c = ((u.duplicated(keep=False) & df['type'].eq('old')) | 
     (df['type'].eq('new') & ~u.duplicated(keep=False)) )
out = df[c].copy()

   prod_id   prod_name  col_1  col_2 type
0      101  electronic     10     10  old
3      102    hardware      2      1  new
4      103       other     22     13  new

Answer 2

如我所见，您希望结果中的每一行都包含一行每行 prod_id 的源行（更确切地说，最后一行）。

type 列的内容取决于所有 col _... 中的所有值列，实际上是从 2 到最后一个，但只有一个的列。

要获得此结果，请定义以下函数：

def grpRes(grp):
    res = grp.iloc[-1,:]
    res.type = 'old' if np.unique(grp.values[:, 2:-1]).size == 1 else 'new'
    return res

然后将此功能应用于每个组：

result = df.groupby('prod_id').apply(grpRes).reset_index(drop=True)

结果是：

   prod_id   prod_name  col_1  col_2 type
0      101  electronic     10     10  old
1      102    hardware      2      1  new
2      103       other     22     13  new

Answer 3

有一个简单的解决方案，当且仅当type = 'old'在所有重复行中排在首位

columns = list(df.columns)
columns.remove('type')
df = df.drop_duplicates(subset=columns, keep='first')

熊猫-根据特定列中的值删除重复项

3 个答案: