Question

我需要有效地处理非常大的一维数组，以便从每个bin中提取一些统计信息，并且我发现scipy.stats中的binned_statistic函数非常有用，因为它包含一个有效运行的'statistic'参数。

我想执行“计数”功能，但不考虑零值。

我在同一数组上与滑动窗口（pandas滚动功能）并行工作，并且可以很好地将零替换为NaN，但是这种情况并未在我的案例中共享。

这是我正在做的一个玩具示例：

import numpy as np
import pandas as pd
from scipy.stats import binned_statistic

# As example with sliding windows, this returns just the length of each window:
a = np.array([1., 0., 0., 1.])
pd.Series(a).rolling(2).count() # Returns [1.,2.,2.,2.]

# You can make the count to do it only if not zero:
nonzero_a = a.copy()
nonzero_a[nonzero_a==0.0]='nan'
pd.Series(nonzero_a).rolling(2).count()   # Returns [1.,1.,0.,1.]

# However, with binned_statistic I am not able to do anything similar:
binned_statistic(range(4), a, bins=2, statistic='count')[0] 
binned_statistic(range(4), nonzero_a, bins=2, statistic='count')[0]
binned_statistic(range(4), np.array([1., False, None, 1.], bins=2, statistic='count')[0]

所有先前的运行都提供相同的输出：[2.，2.]，但我期望[1.，1.]。

找到的唯一选择是传递一个自定义函数，但它的性能比实际情况下实现的功能差得多。

binned_statistic(range(4), a, bins=2, statistic=np.count_nonzero)

Answer 1

我找到了一种简单的方法来复制非零计数，从而将数组转换为0-1并应用sum：

 # Transform all non-zero to 1s
 a = np.array([1., 0., 0., 2.])
 nonzero_a = a.copy()
 nonzero_a[nonzero_a>0.0]=1.0     # nonzero_a = [1., 0., 0., 1.]

 binned_statistic(np.arange(len(nonzero_a)), nonzero_a, bins=bins, statistic='sum')[0]   # Returns [1.0, 1.0]

如何使用binned_statistic计算非零值

1 个答案: