Question

This question显示了如何为特定列C计算数据帧中的NA。如何为所有列（不是groupby列）计算NA？

有些测试代码不起作用：

#!/usr/bin/env python3

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1,1,2,2], 
                   'b':[1,np.nan,2,np.nan],
                   'c':[1,np.nan,2,3]})

# result = df.groupby('a').isna().sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method

# result = df.groupby('a').transform('isna').sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method

result = df.isna().groupby('a').sum()
print(result)
# result:
#          b    c
# a
# False  2.0  1.0

result = df.groupby('a').apply(lambda _df: df.isna().sum())
print(result)
# result:
#    a  b  c
# a
# 1  0  2  1
# 2  0  2  1

所需的输出：

     b    c
a
1    1    1
2    1    0

Answer 1

将apply与isna和sum一起使用。另外，我们选择正确的列，因此不会出现不必要的a列：

注意：apply可能很慢，建议使用一种矢量化解决方案，请参阅WenYoBen，Anky或{{3 }}

df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())

输出

Answer 2

始终最好避免使用groupby.apply来支持cythonized的基本功能，因为这在许多组中可以更好地扩展。这将导致性能大大提高。在这种情况下，请先在整个isnull()上检查DataFrame，然后再检查groupby + sum。

df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#   b  c
#a      
#1  1  1
#2  1  0

为了说明性能提升：

import pandas as pd
import numpy as np

N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
                   'b': np.random.choice([1, np.nan], N),
                   'c': np.random.choice([1, np.nan], N)})

%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 3

另一种方法是在set_index()上使用a并在索引和总和上使用groupby：

df.set_index('a').isna().groupby(level=0).sum()*1

或者：

df.set_index('a').isna().groupby(level=0).sum().astype(int)

或者没有@byWenYoBen的集体照：

df.set_index('a').isna().sum(level=0).astype(int)

Answer 4

我先做count，然后再写成value_counts，这是我没有使用apply的原因，因为它通常表现不佳

df.groupby('a')[['b','c']].count().rsub(df.a.value_counts(dropna=False),axis=0)
Out[78]: 
   b  c
1  1  1
2  1  0

替代

df.isna().drop('a',1).astype(int).groupby(df['a']).sum()
Out[83]: 
   b  c
a      
1  1  1
2  1  0

Answer 5

您的问题有答案（您将_df的错误键入为df）：

result = df.groupby('a')['b', 'c'].apply(lambda _df: _df.isna().sum())
result
   b  c
a      
1  1  1
2  1  0

Answer 6

使用drop后，您需要apply列。

df.groupby('a').apply(lambda x: x.isna().sum()).drop('a',1)

输出：

Answer 7

另一项肮脏的工作：

full_name

输出：

df.set_index('a').isna().astype(int).groupby(level=0).sum()

Answer 8

您可以编写自己的聚合函数，如下所示：

df.groupby('a').agg(lambda x: x.isna().sum())

结果

     b    c
a          
1  1.0  1.0
2  1.0  0.0

熊猫通过所有列的groupby来计算NA

8 个答案: