Question

假设我有一个正整数的二维数组：

a = numpy.array([[1, 1, 2],
                 [1, 2, 5],
                 [1, 3, 6],
                 [3, 3, 3],
                 [3, 4, 6],
                 [4, 5, 6],
                ])

和阈值（正整数）。我想计算，每行，< threshold的次数，>= threshold and < threshold+2的数量和>= threshold+2的数量。{结果将存储在3 x n大小的数组中，其中n = a.shape[0]和3列中的每一列都对应于阈值分区。

对于上面的示例和threshold = 3，它将是：

b = numpy.array([[3, 0, 0],
                 [2, 0, 1],
                 [1, 1, 1],
                 [0, 3, 0],
                 [0, 2, 1],
                 [0, 1, 2],
                ])

我的解决方案是使用与蒙版结合的for循环，以便我可以为每一行单独应用蒙版。但是在阵列上使用for循环感觉不对。是否有更优化的方法来实现这一目标？

到目前为止我的解决方案：

b = []
for row in a:
    b.append((numpy.sum(row < threshold),
              numpy.sum((row >= threshold) * (row < threshold + 2)),
              numpy.sum(row >= threshold + 2)))
b = numpy.array(b)

Answer 1

方法＃1

对阈值使用elementwise comparison并对每一行求和 -

t = 3 # threshold
mask0 = (a<t)
mask2 = a>=t+2
mask1 = (a>=t) & ~mask2
out = np.c_[mask0.sum(1), mask1.sum(1), mask2.sum(1)]

方法＃2

如果你仔细考虑一下，我们在那里创造三个箱子。因此，我们可以使用获取每个元素的bin ID，最后根据ID获取每行的计数。我们将使用np.searchsorted来获取这些bin ID，然后使用elementwise equate并在每一行中求和。

因此，我们会有一个解决方案，就像这样 -

t = 3 # threshold
bins = [t, t+2]   # Create intervals
N = len(bins)+1   # Number of cols in output
idx = np.searchsorted(bins,a,'right') # Get bin IDs
out = np.column_stack([(idx==i).sum(1) for i in range(N)])

我们可以使用broadcasting -

对最后一步进行矢量化

out = (idx == np.arange(N)[:,None,None]).sum(2).T

还有一个矢量化替代方案，使用np.bincount -

也可以提高内存效率

M = a.shape[0]
r = N*np.arange(M)[:,None]
out = np.bincount((idx + r).ravel(),minlength=M*N).reshape(M,N)

Answer 2

你必须打破积分3和5。我们可以使用np.searchsorted来查找a的每个元素相对于我们的断点的位置。

np.searchsorted([3, 5], 1, side='right')将返回0因为1应插入位置0以维持排序。
np.searchsorted([3, 5], 3, side='right')将返回1，因为3可以插入位置0或任何其他值3占据的位置以维持排序。要插入到相等元素左侧的默认行为。我们可以将其更改为插入所有相等元素的右侧。这说明了条件< threshold
np.searchsorted([3, 5], 5)将返回1
np.searchsorted([3, 5], 7)将返回2
我使用np.eye来构建子数组以进行求和，以计算每个bin中有多少个。

np.eye(3, dtype=int)[np.searchsorted([3, 5], a, side='right')].sum(1)

array([[3, 0, 0],
       [2, 0, 1],
       [1, 1, 1],
       [0, 3, 0],
       [0, 2, 1],
       [0, 1, 2]])

我们可以用函数

来概括它

def count_bins(a, threshold, interval_sizes):
    edges = np.append(threshold, interval_sizes).cumsum()
    eye = np.eye(edges.size + 1, dtype=int)
    return eye[edges.searchsorted(a, side='right')].sum(1)

count_bins(a, 3, [2])

array([[3, 0, 0],
       [2, 0, 1],
       [1, 1, 1],
       [0, 3, 0],
       [0, 2, 1],
       [0, 1, 2]])

或者

count_bins(a, 3, [1, 1])

array([[3, 0, 0, 0],
       [2, 0, 0, 1],
       [1, 1, 0, 1],
       [0, 3, 0, 0],
       [0, 1, 1, 1],
       [0, 0, 1, 2]])

但我宁愿返回一个pandas数据框来更清楚地看待事情

def count_bins(a, threshold, interval_sizes):
    edges = np.append(threshold, interval_sizes).cumsum()
    eye = np.eye(edges.size + 1, dtype=int)
    labels = ['{:0.0f} to {:0.0f}'.format(i, j) for i, j in zip(np.append(-np.inf, edges), np.append(edges, np.inf))]
    return pd.DataFrame(
        eye[edges.searchsorted(a, side='right')].sum(1),
        columns=labels
    )

count_bins(a, 3, [2])

   -inf to 3  3 to 5  5 to inf
0          3       0         0
1          2       0         1
2          1       1         1
3          0       3         0
4          0       2         1
5          0       1         2

如何将多个蒙版应用于数组并计算每行的出现次数

2 个答案: