熊猫检查是否。每组中的true等于1

时间:2018-09-11 16:03:42

标签: python-3.x pandas dataframe pandas-groupby

我有以下df

cluster_id    dummy
1             False
1             True
1             True
2             False
2             False
3             False
3             True

我想创建一个布尔列'dummy_display',如果每个群集中至少有一个False并且dummy == True的数量少于以下,则将其设置为True簇的长度,所以结果应该看起来像

cluster_id    dummy     dummy_display
1             False     False
1             True      False
1             True      False
2             False     True
2             False     True
3             False     False
3             True      False

2 个答案:

答案 0 :(得分:3)

transformany一起使用

In [137]: ~df.groupby('cluster_id')['dummy'].transform('any')
Out[137]:
0    False
1    False
2    False
3     True
4     True
5    False
6    False
Name: dummy, dtype: bool

In [139]: df['dummy_display'] = ~df.groupby('cluster_id')['dummy'].transform('any')

In [140]: df
Out[140]:
   cluster_id  dummy  dummy_display
0           1  False          False
1           1   True          False
2           1   True          False
3           2  False           True
4           2  False           True
5           3  False          False
6           3   True          False

答案 1 :(得分:2)

我认为...

@Zero的答案比较简单,应该是goto方法。但是我忍不住要提供一个Numpy替代方案。

i, u = pd.factorize(df.cluster_id)
a = np.zeros(len(u), np.bool8)
np.logical_or.at(a, i, df.dummy.values)

df.assign(dummpy_display=a[i])

   cluster_id  dummy  dummpy_display
0           1  False            True
1           1   True            True
2           1   True            True
3           2  False           False
4           2  False           False
5           3  False            True
6           3   True            True

故障

pandas.factorize创建一个整数数组,这些整数表示df.cluster_id中的唯一值

i, u = pd.factorize(df.cluster_id)
print(f"factorization (i): {[*i]}\nunique values (u): {[*u]}")

factorization (i): [0, 0, 0, 1, 1, 2, 2]
unique values (u): [1, 2, 3]

然后我们为每个唯一的False初始化cluster_id

a = np.zeros(len(u), np.bool8)
print(f"accumulated `or` init (a): {[*a]}")

accumulated `or` init (a): [False, False, False]

然后使用np.logical_or.at函数通过给定指定索引和布尔值的or逻辑进行累加

np.logical_or.at(a, i, df.dummy.values)
print(f"accumulated `or` post (a): {[*a]}")
print(f"broadcast over factorization (a[i]):\n  {[*a[i]]}")

accumulated `or` post (a): [True, False, True]
broadcast over factorization (a[i]):
  [True, True, True, False, False, True, True]

让我们更深入地了解。我将进行遍历并显示分组累积变量a

发生的变化
a = [False, False, False]
print(f"accumulate `or` init (a): {a}", end='\n\n')

d = df.assign(i=i, a=None)[['cluster_id', 'i', 'dummy', 'a']]

for j in d.index:
  a[d.at[j, 'i']] |= d.at[j, 'dummy']
  d.at[j, 'a'] = [*a]

d


   cluster_id  i  dummy                              a
            at ↓     ⇩  or a[0]          ⇩
0           1  0  False              [False, False, False]
                             ╭──────────⤴
            at ↓     ⇩  or a[0] ==       ⇩
1           1  0   True               [True, False, False]
                             ╭──────────⤴
            at ↓     ⇩  or a[0] ==       ⇩
2           1  0   True               [True, False, False]
                             ╭─────────────────⤴
            at ↓     ⇩  or a[1] ==              ⇩
3           2  1  False               [True, False, False]
                             ╭─────────────────⤴
            at ↓     ⇩  or a[1] ==              ⇩
4           2  1  False               [True, False, False]
                             ╭────────────────────────⤴
            at ↓     ⇩  or a[2] ==                     ⇩
5           3  2  False               [True, False, False]
                             ╭────────────────────────⤴
            at ↓     ⇩  or a[2] ==                     ⇩
6           3  2   True                [True, False, True]

与上面显示的广播相同

print(f"result (a): {a}\nbroadcasted (a[i]):\n  {[a[j] for j in i]}")

result (a): [True, False, True]
broadcasted (a[i]):
  [True, True, True, False, False, True, True]