Question

我想知道numpy数组的最后一个轴的前2个最常用值及其频率。我已经可以使用它了，但是我想使其运行得更快。

示例案例

实际数据是uint16类型的（720，1280，64）形状的numpy数组，但为简单起见，让我们想象一下它是（2，2，4）数组。

因此数据将如下所示：

               0          1  
          ------------------------
        0 | [1,1,1,2] [1,1,2,2]
        1 | [2,2,2,1] [1,1,1,3]

对于每个x，y位置，我想知道什么是最常见的值和第二最常见的值，以及最常见的值和第二最常见的值出现多少次（如果两个值相同，则选择一个很好）。

因此对于上面的示例，最常见的值为：

               0          1  
          ------------------------
        0 |    1          1
        1 |    2          1

它们出现了多少次：

               0          1  
          ------------------------
        0 |    3          2
        1 |    3          3

第二个最常用的值（如果没有第二个最常用的值，则将零放在上面就可以了）

               0          1  
          ------------------------
        0 |    2          2
        1 |    1          3

第二个最常见的值出现的频率。如果没有第二个最常见的值，那么在这里放任何东西都可以。

               0          1  
          ------------------------
        0 |    1          2
        1 |    1          1

当前解决方案

如果数组被称为“ a”，我首先这样做是为了获得最常见的值及其出现的次数：

import numpy as np
from scipy.stats import mode

a = np.array([
    [[1,1,1,2], [1,1,2,2]],
    [[2,2,2,1], [1,1,1,3]]
])

most_common_value, most_common_count = mode(a, axis=2)
print(most_common_value.squeeze())
print(most_common_count.squeeze())

输出：

[[1 1]
 [2 1]]

[[3 2]
 [3 3]]

然后获得第二个最常用的值，我只是删除了最常用的值，然后再次运行上面的值。要删除，我首先创建一个要删除其值的掩码。

mask = a == most_common_value
print(mask)

输出：

array([[[ True,  True,  True, False],
        [ True,  True, False, False]],

       [[ True,  True,  True, False],
        [ True,  True,  True, False]]])

现在我真正想做的就是删除所有为True的东西，但是由于尺寸必须沿轴保持不变，而不是实际删除任何东西，所以我用NaN代替了最常见的值。

因为这些是不知道NaN的uint16，所以我必须先转换为float。

a = a.astype('float')
np.putmask(a, mask, np.nan)
print(a)

输出：

[[[nan nan nan  2.]
  [nan nan  2.  2.]]

 [[nan nan nan  1.]
  [nan nan nan  3.]]]

现在mode可以再次在其上运行，只是需要告知它忽略NaN，并且需要将结果再次转换为uint16。

m = mode(a, axis=2, nan_policy='omit')
m = [x.astype('uint16') for x in m]
second_most_common_value, second_most_common_count = m
print(second_most_common_value.squeeze())
print(second_most_common_count.squeeze())

输出：

[[2 2]
 [1 3]]

[[1 2]
 [1 1]]

在这一点上，我拥有所有最常见和次要的通用值，以及它们在轴上出现了多少次，所以我完成了。

性能

正如我提到的，此解决方案有效，但速度较慢。这是上面重复的内容，但是作为包含实际数据的脚本，您可以尝试运行。我也put it up on pastebin是为了方便复制。

独立示例：

import time
import numpy as np
from scipy.stats import mode

a = np.random.randint(30000, size=(720, 1280, 64))

start_time = time.time()

most_common_value, most_common_count = mode(a, axis=2)

mask = a == most_common_value
a = a.astype('float')
np.putmask(a, mask, np.nan)

m = mode(a, axis=2, nan_policy='omit')
m = [x.astype('uint16') for x in m]
second_most_common_value, second_most_common_count = m

end_time = time.time()
print(f'Took {end_time-start_time:.2f} seconds to run')

输出：

Took 123.29 seconds to run

理想情况下，此操作应在30秒以内完成，但欢迎进行任何改进。

为什么要这样做？

您可能已经注意到，（720、1280、64）的前两个尺寸是1280x720图像分辨率。每个像素的64个值是该像素下的子像素的颜色，并通过索引引用已知的调色板。

要知道如何为每个像素着色，我需要知道两种最常用的调色板颜色，以便将它们混合。数据来自我创建的场景中的Blender，所以我知道每个像素几乎总是只有两种不同的调色板颜色。

该项目的重点是提高my website上的渲染质量，用户可以在其中创建即时的自定义动画；解决这个问题将消除渲染中锯齿状的边缘。

由于我的动画有600帧，因此每帧要花大约一天的时间，而且我希望能够在睡觉时开始运行它，并在早上获得完成的结果，因此因此，我想稍微提高性能。

Answer 1

我最终写了一个简单的模式，该模式遍历最终轴的所有值，然后尝试每个值是否都是新模式。对于我的数据，这种幼稚的解决方案最终仍然是scipy.stats.mode的两倍。

def silly_mode(a):
    """Returns mode and counts for final axis of a numpy array.

    Same as scipy.stats.mode(a, axis=-1).squeeze()
    """

    # Best mode candidate discovered so far
    most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
    most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)

    # Silly solution based on just iterating all of final dimension,
    # but still beats scipy if final dimension is less than 100 in length.
    for i in range(0, a.shape[2]):

        # Find candidate value for each cell
        val = np.expand_dims(a[:,:,i], axis = -1)

        # Count how many times it appears
        counts = np.count_nonzero(a == val, axis = -1).astype(a.dtype)

        # Find out which ones should become the new mode values
        update_mask = counts > most_common_count[:,:]

        # Update mode value and its count where necessary
        np.putmask(most_common_value, update_mask, val)
        np.putmask(most_common_count, update_mask, counts)

    return most_common_value, most_common_count

这还有一个好处，即可以扩展到查找第二个最常见的值，我认为这比我用scipy采取模式，删除模式值然后找到的方法要快得多。再次进入模式。

一旦工作，我将以寻找第二个最常见的值的方式来更新此答案。

更新：

此处是查找前2个最常用值及其计数的功能。我不会在任何关键的事情上依靠它，因为除了一些测试用例之外，还没有经过适当的测试。

def top_2_most_common_values(a, ignore_zeros = False):
    """Returns mode and counts for each mode of final axis of a numpy array,
    and also returns the second most common value and its counts.

    Similar to calling scipy.stats.mode(a, axis=-1).squeeze() to find the mode,
    except this also returns the second most common values.

    If ignore_zeros is true, then zero will not be considered as a mode candidate.
    In this case a zero instead signifies that there was no most common or second
    most common value, and so the count will also be zero.
    """

    # Silly solution based on just iterating all of final dimension
    most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
    most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
    second_most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
    second_most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)

    for i in range(0, a.shape[2]):

        # Find candidate value for each cell
        val = np.expand_dims(a[:,:,i], axis = -1)

        # Count how many times it appears
        counts = np.count_nonzero(a == val, axis = -1).astype(a.dtype)

        # Find out which ones should become the new mode values
        update_mask = counts > most_common_count[:,:]
        if ignore_zeros:
            update_mask &= val.squeeze() != 0

        # If most common value changes, then what used to be most common is now second most common
        # Without the next two lines a like [1,2,2] would fail, as the second most common value
        # is never encountered again after being initially set to be the most common one.
        np.putmask(second_most_common_value, update_mask, most_common_value)
        np.putmask(second_most_common_count, update_mask, most_common_count)        

        # Update mode value and its count where necessary
        np.putmask(most_common_value, update_mask, val)
        np.putmask(most_common_count, update_mask, counts)

        # In a case like [2, 0, 0, 1] the last 1 isn't the new most common value, but it 
        # still should be updated as the second most common value. For these cases separately check 
        # if any encountered value might be the second most common one.
        update_mask = (counts >= second_most_common_count[:,:]) & (val.squeeze() != most_common_value[:,:])
        if ignore_zeros:
            update_mask &= val.squeeze() != 0

        # # Save previous best mode and its count before updating
        np.putmask(second_most_common_value, update_mask, val)
        np.putmask(second_most_common_count, update_mask, counts)

    return most_common_value, most_common_count, second_most_common_value, second_most_common_count

沿最后一个轴的3D numpy数组中的最常用和次常用值

示例案例

当前解决方案

性能

为什么要这样做？

1 个答案: