Question

我正在寻找一种计算pivot_tables和频率计数的有效方法，但我的要求是，如果我知道变量的域，那么应该完成域中每个值的计数，而不仅仅是观察到的那些值在样本中。

例如，使用下面的代码，Series.count_values方法输出：

2    2
1    2

但我知道我的变量的域是[0,1,2]所以我真的想要：

0    0
1    2
2    2

以下是重现示例的代码示例

import pandas as pd
import numpy as np

s=pd.Series([1,2,2,1])

def my_value_counts(s,levels):
#levels is a numpy array
    c=s.value_counts()
    foundl=sorted(c.index)
    counts=np.zeros_like(levels)
    for i,l in enumerate(levels):
        if l in foundl:
            counts[i]=c.loc[l]
    return counts

print "Original method"
print s.value_counts()
print "with all levels"
print my_value_counts(s,np.arange(3))

我的问题是：我的代码效率低下吗？看起来像一些排序可能会有所帮助。如果是这样，有没有办法做到这一点，而不必像我在代码中那样重新创建频率表并将其值与values_count的输出相匹配？

谢谢， AL

Answer 1

一种方法是reindex value_counts，其新索引从0开始到最大值+ 1：

In [12]:
s=pd.Series([1,2,2,1])
val = s.value_counts()
val.reindex(np.arange(0, s.max()+1)).fillna(0)

Out[12]:
0    0
1    2
2    2
dtype: float64

Answer 2

In [80]: pd.Series([1,2,2,1]).value_counts().reindex(np.arange(3))
Out[80]: 
0   NaN
1     2
2     2
dtype: float64

In [81]: pd.Series([1,2,2,1]).value_counts().reindex(np.arange(3)).fillna(0)
Out[81]: 
0    0
1    2
2    2
dtype: float64

Answer 3

高效？大概。优雅？不那么。

s.value_counts().combine_first(pd.Series(np.zeros(3)))

count_values具有已知的变量级别

3 个答案: