我想知道numpy数组的最后一个轴的前2个最常用值及其频率。我已经可以使用它了,但是我想使其运行得更快。
实际数据是uint16类型的(720,1280,64)形状的numpy数组,但为简单起见,让我们想象一下它是(2,2,4)数组。
因此数据将如下所示:
0 1
------------------------
0 | [1,1,1,2] [1,1,2,2]
1 | [2,2,2,1] [1,1,1,3]
对于每个x,y位置,我想知道什么是最常见的值和第二最常见的值,以及最常见的值和第二最常见的值出现多少次(如果两个值相同,则选择一个很好)。
因此对于上面的示例,最常见的值为:
0 1
------------------------
0 | 1 1
1 | 2 1
它们出现了多少次:
0 1
------------------------
0 | 3 2
1 | 3 3
第二个最常用的值(如果没有第二个最常用的值,则将零放在上面就可以了)
0 1
------------------------
0 | 2 2
1 | 1 3
第二个最常见的值出现的频率。如果没有第二个最常见的值,那么在这里放任何东西都可以。
0 1
------------------------
0 | 1 2
1 | 1 1
如果数组被称为“ a”,我首先这样做是为了获得最常见的值及其出现的次数:
import numpy as np
from scipy.stats import mode
a = np.array([
[[1,1,1,2], [1,1,2,2]],
[[2,2,2,1], [1,1,1,3]]
])
most_common_value, most_common_count = mode(a, axis=2)
print(most_common_value.squeeze())
print(most_common_count.squeeze())
输出:
[[1 1]
[2 1]]
[[3 2]
[3 3]]
然后获得第二个最常用的值,我只是删除了最常用的值,然后再次运行上面的值。要删除,我首先创建一个要删除其值的掩码。
mask = a == most_common_value
print(mask)
输出:
array([[[ True, True, True, False],
[ True, True, False, False]],
[[ True, True, True, False],
[ True, True, True, False]]])
现在我真正想做的就是删除所有为True的东西,但是由于尺寸必须沿轴保持不变,而不是实际删除任何东西,所以我用NaN代替了最常见的值。
因为这些是不知道NaN的uint16,所以我必须先转换为float。
a = a.astype('float')
np.putmask(a, mask, np.nan)
print(a)
输出:
[[[nan nan nan 2.]
[nan nan 2. 2.]]
[[nan nan nan 1.]
[nan nan nan 3.]]]
现在mode
可以再次在其上运行,只是需要告知它忽略NaN,并且需要将结果再次转换为uint16。
m = mode(a, axis=2, nan_policy='omit')
m = [x.astype('uint16') for x in m]
second_most_common_value, second_most_common_count = m
print(second_most_common_value.squeeze())
print(second_most_common_count.squeeze())
输出:
[[2 2]
[1 3]]
[[1 2]
[1 1]]
在这一点上,我拥有所有最常见和次要的通用值,以及它们在轴上出现了多少次,所以我完成了。
正如我提到的,此解决方案有效,但速度较慢。这是上面重复的内容,但是作为包含实际数据的脚本,您可以尝试运行。我也put it up on pastebin是为了方便复制。
独立示例:
import time
import numpy as np
from scipy.stats import mode
a = np.random.randint(30000, size=(720, 1280, 64))
start_time = time.time()
most_common_value, most_common_count = mode(a, axis=2)
mask = a == most_common_value
a = a.astype('float')
np.putmask(a, mask, np.nan)
m = mode(a, axis=2, nan_policy='omit')
m = [x.astype('uint16') for x in m]
second_most_common_value, second_most_common_count = m
end_time = time.time()
print(f'Took {end_time-start_time:.2f} seconds to run')
输出:
Took 123.29 seconds to run
理想情况下,此操作应在30秒以内完成,但欢迎进行任何改进。
您可能已经注意到,(720、1280、64)的前两个尺寸是1280x720图像分辨率。每个像素的64个值是该像素下的子像素的颜色,并通过索引引用已知的调色板。
要知道如何为每个像素着色,我需要知道两种最常用的调色板颜色,以便将它们混合。数据来自我创建的场景中的Blender,所以我知道每个像素几乎总是只有两种不同的调色板颜色。
该项目的重点是提高my website上的渲染质量,用户可以在其中创建即时的自定义动画;解决这个问题将消除渲染中锯齿状的边缘。
由于我的动画有600帧,因此每帧要花大约一天的时间,而且我希望能够在睡觉时开始运行它,并在早上获得完成的结果,因此因此,我想稍微提高性能。
答案 0 :(得分:0)
我最终写了一个简单的模式,该模式遍历最终轴的所有值,然后尝试每个值是否都是新模式。对于我的数据,这种幼稚的解决方案最终仍然是scipy.stats.mode
的两倍。
def silly_mode(a):
"""Returns mode and counts for final axis of a numpy array.
Same as scipy.stats.mode(a, axis=-1).squeeze()
"""
# Best mode candidate discovered so far
most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
# Silly solution based on just iterating all of final dimension,
# but still beats scipy if final dimension is less than 100 in length.
for i in range(0, a.shape[2]):
# Find candidate value for each cell
val = np.expand_dims(a[:,:,i], axis = -1)
# Count how many times it appears
counts = np.count_nonzero(a == val, axis = -1).astype(a.dtype)
# Find out which ones should become the new mode values
update_mask = counts > most_common_count[:,:]
# Update mode value and its count where necessary
np.putmask(most_common_value, update_mask, val)
np.putmask(most_common_count, update_mask, counts)
return most_common_value, most_common_count
这还有一个好处,即可以扩展到查找第二个最常见的值,我认为这比我用scipy采取模式,删除模式值然后找到的方法要快得多。再次进入模式。
一旦工作,我将以寻找第二个最常见的值的方式来更新此答案。
更新:
此处是查找前2个最常用值及其计数的功能。我不会在任何关键的事情上依靠它,因为除了一些测试用例之外,还没有经过适当的测试。
def top_2_most_common_values(a, ignore_zeros = False):
"""Returns mode and counts for each mode of final axis of a numpy array,
and also returns the second most common value and its counts.
Similar to calling scipy.stats.mode(a, axis=-1).squeeze() to find the mode,
except this also returns the second most common values.
If ignore_zeros is true, then zero will not be considered as a mode candidate.
In this case a zero instead signifies that there was no most common or second
most common value, and so the count will also be zero.
"""
# Silly solution based on just iterating all of final dimension
most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
second_most_common_value = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
second_most_common_count = np.zeros((a.shape[0], a.shape[1]),dtype=a.dtype)
for i in range(0, a.shape[2]):
# Find candidate value for each cell
val = np.expand_dims(a[:,:,i], axis = -1)
# Count how many times it appears
counts = np.count_nonzero(a == val, axis = -1).astype(a.dtype)
# Find out which ones should become the new mode values
update_mask = counts > most_common_count[:,:]
if ignore_zeros:
update_mask &= val.squeeze() != 0
# If most common value changes, then what used to be most common is now second most common
# Without the next two lines a like [1,2,2] would fail, as the second most common value
# is never encountered again after being initially set to be the most common one.
np.putmask(second_most_common_value, update_mask, most_common_value)
np.putmask(second_most_common_count, update_mask, most_common_count)
# Update mode value and its count where necessary
np.putmask(most_common_value, update_mask, val)
np.putmask(most_common_count, update_mask, counts)
# In a case like [2, 0, 0, 1] the last 1 isn't the new most common value, but it
# still should be updated as the second most common value. For these cases separately check
# if any encountered value might be the second most common one.
update_mask = (counts >= second_most_common_count[:,:]) & (val.squeeze() != most_common_value[:,:])
if ignore_zeros:
update_mask &= val.squeeze() != 0
# # Save previous best mode and its count before updating
np.putmask(second_most_common_value, update_mask, val)
np.putmask(second_most_common_count, update_mask, counts)
return most_common_value, most_common_count, second_most_common_value, second_most_common_count