我有DataFrame
df = pd.DataFrame({
'colA':['?',2,3,4,'?'],
'colB':[1,2,'?',3,4],
'colC':['?',2,3,4,5]
})
我想计算每一列中'?'
的数量并返回以下输出-
colA - 2
colB - 1
colC - 1
有没有办法立即返回此输出。现在,我唯一知道的方法是为每一列编写一个for循环。
答案 0 :(得分:8)
看起来简单的方法是
df[df == '?'].count()
结果是
colA 2
colB 1
colC 1
dtype: int64
其中df[df == '?']
为我们的DataFrame提供了?
和Nan
colA colB colC
0 ? NaN ?
1 NaN NaN NaN
2 NaN ? NaN
3 NaN NaN NaN
4 ? NaN NaN
和每列的count非NA单元。
请查看其他解决方案:good readable和most faster
答案 1 :(得分:4)
您可以在此处使用numpy.count_nonzero
。
pd.Series(np.count_nonzero(df.to_numpy()=='?', axis=0), index=df.columns)
# pd.Series((df.values == '?').sum(0), index=df.columns)
colA 2
colB 1
colC 1
dtype: int64
Timeit结果:
以df提供问题为基准。
In [172]: %timeit pd.Series(np.count_nonzero(df.to_numpy()=='?', axis=0), index=df.columns)
86.1 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Ezer K's answer
In [168]: %timeit pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, df.values), index=df.columns)
158 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# YOBEN_S's answer
In [169]: %timeit df.eq('?').sum()
298 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Bear Brown's answer
In [165]: %timeit df[df == '?'].count()
1.43 ms ± 99.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
以df
形状的(1_000_000, 3)
进行基准测试
big_df = pd.DataFrame(df.to_numpy().repeat(200_000,axis=0))
big_df.shape
(1000000, 3)
In [186]: %timeit pd.Series(np.count_nonzero(big_df.to_numpy()=='?', axis=0), index=big_df.columns)
53.1 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [187]: %timeit big_df.eq('?').sum()
171 ms ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [188]: %timeit big_df[big_df == '?'].count()
314 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [189]: %timeit pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, big_df.values), index=big_df.columns)
174 ms ± 3.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 2 :(得分:3)
我们可以做sum
df.eq('?').sum()
Out[182]:
colA 2
colB 1
colC 1
dtype: int64
答案 3 :(得分:1)
@熊棕答案可能是最优雅的,更快的选择是使用numpy:
from collections import Counter
%%timeit
df[df == '?'].count()
5.2 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
pd.Series(np.apply_along_axis(lambda x: Counter(x)['?'], 0, df.values), index=df.columns)
218 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
答案 4 :(得分:0)
BENY answer 的变化:
(df=='?').sum()
Out[182]:
colA 2
colB 1
colC 1
dtype: int64