如何在熊猫的多列中计算特定值

时间:2020-07-26 16:54:01

标签: python pandas

我有DataFrame

df = pd.DataFrame({
    'colA':['?',2,3,4,'?'],
    'colB':[1,2,'?',3,4],
    'colC':['?',2,3,4,5]
})

我想计算每一列中'?'的数量并返回以下输出-

colA - 2
colB - 1
colC - 1

有没有办法立即返回此输出。现在,我唯一知道的方法是为每一列编写一个for循环。

5 个答案:

答案 0 :(得分:8)

看起来简单的方法是

df[df == '?'].count()

结果是

colA    2
colB    1
colC    1
dtype: int64

其中df[df == '?']为我们的DataFrame提供了?Nan

  colA colB colC
0    ?  NaN    ?
1  NaN  NaN  NaN
2  NaN    ?  NaN
3  NaN  NaN  NaN
4    ?  NaN  NaN

和每列的count非NA单元。

请查看其他解决方案:good readablemost faster

答案 1 :(得分:4)

您可以在此处使用numpy.count_nonzero

pd.Series(np.count_nonzero(df.to_numpy()=='?', axis=0), index=df.columns)
# pd.Series((df.values == '?').sum(0), index=df.columns)

colA    2
colB    1
colC    1
dtype: int64

Timeit结果:

以df提供问题为基准。

In [172]: %timeit pd.Series(np.count_nonzero(df.to_numpy()=='?', axis=0), index=df.columns)
86.1 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

#Ezer K's answer
In [168]: %timeit pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, df.values), index=df.columns)
158 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# YOBEN_S's answer
In [169]: %timeit df.eq('?').sum()
298 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# Bear Brown's answer
In [165]: %timeit df[df == '?'].count()
1.43 ms ± 99.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df形状的(1_000_000, 3)进行基准测试

big_df = pd.DataFrame(df.to_numpy().repeat(200_000,axis=0))
big_df.shape
(1000000, 3)

In [186]: %timeit pd.Series(np.count_nonzero(big_df.to_numpy()=='?', axis=0), index=big_df.columns)
53.1 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [187]: %timeit big_df.eq('?').sum()
171 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [188]: %timeit big_df[big_df == '?'].count()
314 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [189]: %timeit pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, big_df.values), index=big_df.columns)
174 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

答案 2 :(得分:3)

我们可以做sum

df.eq('?').sum()
Out[182]: 
colA    2
colB    1
colC    1
dtype: int64

答案 3 :(得分:1)

@熊棕答案可能是最优雅的,更快的选择是使用numpy:

from collections import Counter    

%%timeit
df[df == '?'].count()

5.2 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, df.values), index=df.columns)

218 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

答案 4 :(得分:0)

BENY answer 的变化:

(df=='?').sum()
Out[182]: 
colA    2
colB    1
colC    1
dtype: int64