Question

我有一个很大的 DF，我想删除一些多余的行。

DF 如下：

A   B    C
foo 12   *
foo 12   z <- redundant row
foo 12   x <- redundant row
foo 15   x
bar 13   z
bar 13   x

我想删除存在具有相同值的另一行的行，除了列 C 为“*”。

因此，产生的 df 将是：

A   B    C
foo 12   * <- kept because it is the "foo 12" row that has the "*"
foo 15   x <- kept because there is no "foo 15 *" row
bar 13   z <- kept because there is no "bar 13 *" row
bar 13   x <- kept because there is no "bar 13 *" row

如果可能的话，为了内存使用问题，我想避免笛卡尔积/将 df 与其自身合并。（虽然如果不可能，可以拆分和重新组装df）

Answer 1

如果我理解正确，请尝试将 * 的行保留在带有 * 的组中，否则保留所有行：

s = df.groupby(['A', 'B'])['C'].transform(lambda c: c.eq('*').any())
df = df[df['C'].eq('*') | ~s]

df：

     A   B  C
0  foo  12  *
3  foo  15  x
4  bar  13  z
5  bar  13  x

在 A 中查找包含 B 的 * 和 C 组

s = df.groupby(['A', 'B'])['C'].transform(lambda c: c.eq('*').any())

0     True
1     True
2     True
3    False
4    False
5    False
Name: C, dtype: bool

在*中查找C：

df['C'].eq('*')

0     True
1    False
2    False
3    False
4    False
5    False
Name: C, dtype: bool

然后 or 并否定：

df['C'].eq('*') | ~s

0     True
1    False
2    False
3     True
4     True
5     True
Name: C, dtype: bool

熊猫根据另一行的存在删除行

1 个答案: