Question

三行代码在R，G，B上加上掩码[height，width，1]，也将[height，width，1]拖延此代码的运行时间，从不到一秒缩短到5-10分钟。

是否有更好的方法添加两个numpy矩阵？我知道这是来自加法过程，因为当我将其取出时，它的运行速度又大大提高了。有什么洞察力为什么会这么慢？

这是RGB颜色垫。它被分成称为超级像素的小区域，这些区域只是像素的分组。我正在尝试获取每个分组中所有像素的平均值，并用该单一颜色填充该分组。第一次执行时，这可以在不到一秒钟的时间内完美完成图片的制作。但是，所有黑人都被带走了。为了解决此问题，我决定在超级像素蒙版为零但仍为零的地方加1，这样我就可以算出平均值中的黑色像素。

import cv2 as cv
import os
import numpy as np
img = cv.imread(path+item)
f, e = os.path.splitext(path+item)

for x in range(0, 3):
    img = cv.pyrDown(img)

height, width, channel = img.shape

img_super = cv.ximgproc.createSuperpixelSLIC(img, cv.ximgproc.MSLIC, 100, 10.0)
img_super.iterate(3)

labels = np.zeros((height, width), np.int32)
labels = img_super.getLabels(labels)

super_pixelized = np.zeros_like(img)

print("Check-1")

for x in range(0, img_super.getNumberOfSuperpixels()):
    new_img = img.copy()
    #print(new_img.shape)

    mask = cv.inRange(labels, x, x)

    new_img = cv.bitwise_and(img, new_img, mask=mask)

    r, g, b = np.dsplit(new_img, 3)

    print("Check-2")
    basis = np.expand_dims(mask, 1)

    basis = basis * 1/255

    print(basis)

    r = np.add(r, basis)
    g = np.add(g, basis)
    b = np.add(b, basis)

    r_avg = int(np.sum(r)/np.count_nonzero(r))
    g_avg = int(np.sum(g)/np.count_nonzero(g))
    b_avg = int(np.sum(b)/np.count_nonzero(b))

    #print(r_avg)
    #print(g_avg)
    #print(b_avg)

    r = mask * r_avg
    g = mask * g_avg
    b = mask * b_avg

    final_img = np.dstack((r, g, b))

    super_pixelized = cv.bitwise_or(final_img, super_pixelized)

这个简单的添加过程导致代码运行时间大大增加。

Answer 1

修复减速

放慢程序运行速度的具体问题在于对np.expand_dims(...)的调用：

basis = np.expand_dims(mask, 1)

第二个参数是“在放置新轴的扩展轴中的位置”。由于mask在该点有2个轴，因此您要在第一个和第二个之间插入一个新轴。

例如：

>>> import numpy as np
>>> mask = np.zeros((240, 320), np.uint8)
>>> mask.shape
(240L, 320L)
>>> expanded = np.expand_dims(mask, 1)
>>> expanded.shape
(240L, 1L, 320L)

我们得到了一个(240L, 1L, 320L)形状的图像，我们确实想要(240L, 320L, 1L)。

稍后，您将此错误形状的数组添加到形状为(240L, 320L, 1L)的每个分割通道图像中。

>>> img = np.zeros((240,320,3), np.uint8)
>>> r, g, b = np.dsplit(img, 3)
>>> r.shape
(240L, 320L, 1L)
>>> r = np.add(r, expanded)
>>> r.shape
(240L, 320L, 320L)

由于numpy broadcasting规则的工作原理，最终得到了320通道的图像（而不是1通道）。

在后续步骤中要处理的值数量要大几个数量级，因此会急剧下降。

修复很简单，只需在正确的位置添加轴即可

basis = np.expand_dims(mask, 2)

这将解决速度下降问题，但是，还有更多的问题需要解决，还有可能需要进行优化。

准备优化

由于我们对处理标签的代码的性能感兴趣，因此让我们进行一些重构，并制作一个包含所有通用位的简单测试工具，以及对单个步骤进行计时并报告计时的方法：

文件superpix_harness.py

import cv2
import time

def run_test(superpix_size, fn, file_name_in, reduce_count, file_name_out):
    times = []
    times.append(time.time())

    img = cv2.imread(file_name_in)
    for _ in range(0, reduce_count):
        img = cv2.pyrDown(img)
    times.append(time.time())

    img_super = cv2.ximgproc.createSuperpixelSLIC(img, cv2.ximgproc.MSLIC, superpix_size, 10.0)
    img_super.iterate(3)
    labels = img_super.getLabels()
    superpixel_count = img_super.getNumberOfSuperpixels()    
    times.append(time.time())

    super_pixelized = fn(img, labels, superpixel_count)
    times.append(time.time())   

    cv2.imwrite(file_name_out, super_pixelized)
    times.append(time.time())

    return (img.shape, superpix_size, superpixel_count, times)

def print_header():
    print "Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total"

def print_report(test_result):
    shape, sp_size, sp_count, times = test_result
    print "%d , %d , %d , %d" % (shape[0], shape[1], sp_size, sp_count),
    for i in range(len(times) - 1):
        print (", %0.4f" % (times[i+1] - times[i])),
    print ", %0.4f" % (times[-1] - times[0])

def measure_fn(fn):
    print_header()
    for reduction in [3,2,1,0]:
        for sp_size in [100,50,25,12]:
            print_report(run_test(sp_size, fn, 'barrack.jpg', reduction, 'output_%01d_%03d.jpg' % (reduction, sp_size)))

碰巧，这是我第一次使用（barrack.jpg）进行测试的足够大的图像：

基线

好吧，让我们将处理代码重构为一个独立的函数，并在过程中对其进行一些清理。

首先，请注意，由于我们使用的是Python，因此我们谈论的不是Mat，而是numpy.ndarray。要记住的另一种想法是，OpenCV默认使用BGR颜色格式，因此应适当重命名变量。

用new_img = img.copy()制作的输入图像的初始副本是无用的，因为您已经足够快地覆盖了它。让我们删除它，然后执行new_img = cv.bitwise_and(img, img, mask=mask)。

现在，我们首先需要了解导致您陷入此难题的原因。遮盖标签特定区域后，将平均强度计算为

b_avg = int(np.sum(b) / np.count_nonzero(b))

您已正确识别出问题-尽管非零像素的计数正确地打折了不属于当前标签的任何东西，但它也打折了确实属于该标签的任何零值像素（因此抛出了结果平均值）。

与您尝试过的方法相比，有一个更简单的解决方法-只需除以mask中非零像素的数量（然后在所有3个计算中重复使用此数量）即可。

最后，我们可以利用numpy indexing来将通道平均颜色仅写入被掩盖的像素（例如b[mask != 0] = b_avg）。

文件op_labels.py

import cv2
import numpy as np

def super_pixelize(img, labels, superpixel_count):
    result = np.zeros_like(img)

    for x in range(superpixel_count):
        mask = cv2.inRange(labels, x, x)
        new_img = cv2.bitwise_and(img, img, mask=mask)

        r, g, b = np.dsplit(new_img, 3)

        label_pixel_count = np.count_nonzero(mask)
        b_avg = np.uint8(np.sum(b) / label_pixel_count)
        g_avg = np.uint8(np.sum(g) / label_pixel_count)
        r_avg = np.uint8(np.sum(r) / label_pixel_count)

        b[mask != 0] = b_avg
        g[mask != 0] = g_avg
        r[mask != 0] = r_avg

        final_img = np.dstack((r, g, b))

        result = cv2.bitwise_or(final_img, result)

    return result

现在我们可以衡量代码的性能了。

基准脚本：

from superpix_harness import *

import op_labels    
measure_fn(op_labels.super_pixelize)

时间：

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1490 , 0.0590 , 0.3990 , 0.0070 , 0.6140
3 ,  420 ,  336 ,  50 ,  568 , 0.1490 , 0.0670 , 1.4510 , 0.0070 , 1.6740
3 ,  420 ,  336 ,  25 , 1415 , 0.1480 , 0.0720 , 3.6580 , 0.0080 , 3.8860
3 ,  420 ,  336 ,  12 , 3009 , 0.1490 , 0.0860 , 7.7170 , 0.0070 , 7.9590
2 ,  839 ,  672 , 100 ,  617 , 0.1460 , 0.3570 , 7.1140 , 0.0150 , 7.6320
2 ,  839 ,  672 ,  50 , 1732 , 0.1460 , 0.3590 , 20.0610 , 0.0150 , 20.5810
2 ,  839 ,  672 ,  25 , 3556 , 0.1520 , 0.3860 , 40.8780 , 0.0160 , 41.4320
2 ,  839 ,  672 ,  12 , 6627 , 0.1460 , 0.3990 , 76.1310 , 0.0160 , 76.6920
1 , 1678 , 1344 , 100 , 1854 , 0.1430 , 2.2480 , 88.3880 , 0.0460 , 90.8250
1 , 1678 , 1344 ,  50 , 4519 , 0.1430 , 2.2440 , 221.7200 , 0.0580 , 224.1650
1 , 1678 , 1344 ,  25 , 9083 , 0.1530 , 2.2100 , 442.7040 , 0.0480 , 445.1150
1 , 1678 , 1344 ,  12 , 17869 , 0.1440 , 2.2620 , 849.9970 , 0.0500 , 852.4530
0 , 3356 , 2687 , 100 , 4786 , 0.1300 , 10.9440 , 916.8950 , 0.1570 , 928.1260
0 , 3356 , 2687 ,  50 , 11942 , 0.1280 , 10.7100 , 2284.5040 , 0.1680 , 2295.5100
0 , 3356 , 2687 ,  25 , 29066 , 0.1300 , 10.7440 , 5561.0440 , 0.1690 , 5572.0870
0 , 3356 , 2687 ,  12 , 59634 , 0.1250 , 10.9860 , 11409.9540 , 0.1770 , 11421.2420

虽然在小图像尺寸（和相对较少的标签数量）下不再存在运行时间过长的问题，但很明显它缩放性很差。我们应该能够做得更好。

优化基准线

首先，我们可以避免将图像拆分为单通道图像，对其进行处理，然后将它们重新组合成BGR格式的需求。幸运的是，OpenCV提供了功能cv2.mean，它将计算（可选地被遮罩的）图像的每通道均值。

另一个有用的优化是在后续迭代中预先分配和重用mask数组（cv2.inRange有一个可选参数，让您为其提供输出数组以供重用）。分配（和取消分配）的成本可能很高，因此您做的越少越好。

最重要的一点是，每个超像素（带有特定标签的区域）的大小通常比整个图像小得多。而不是为每个标签处理整个图像，我们应该将大部分工作限制在感兴趣的区域（ROI），该区域是适合属于特定标签的像素的最小矩形。

要确定投资回报率，我们可以在mask上使用cv2.boundingRect。

文件improved_labels.py

import cv2
import numpy as np

def super_pixelize(img, labels, superpixel_count):
    result = np.zeros_like(img)

    mask = np.zeros(img.shape[:2], np.uint8) # Here it seems to make more sense to pre-alloc and reuse
    for label in range(0, superpixel_count):
        cv2.inRange(labels, label, label, dst=mask)

        # Find the bounding box of this label
        x,y,w,h = cv2.boundingRect(mask)

        # Work only on the rectangular region containing the label
        mask_roi = mask[y:y+h,x:x+w]
        img_roi = img[y:y+h,x:x+w]

        # Per-channel mean of the masked pixels (we're usingo BGR, so we drop the useless 4th channel it gives us)
        roi_mean = cv2.mean(img_roi, mask_roi)[:3]

        # Set all masked pixels in the ROI of the target image the the mean colour
        result[y:y+h,x:x+w][mask_roi != 0] = roi_mean

    return result

基准脚本：

from superpix_harness import *

import improved_labels
measure_fn(improved_labels.super_pixelize)

时间：

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1500 , 0.0600 , 0.0250 , 0.0070 , 0.2420
3 ,  420 ,  336 ,  50 ,  568 , 0.1490 , 0.0670 , 0.0760 , 0.0070 , 0.2990
3 ,  420 ,  336 ,  25 , 1415 , 0.1480 , 0.0740 , 0.1740 , 0.0070 , 0.4030
3 ,  420 ,  336 ,  12 , 3009 , 0.1480 , 0.0860 , 0.3560 , 0.0070 , 0.5970
2 ,  839 ,  672 , 100 ,  617 , 0.1510 , 0.3720 , 0.2450 , 0.0150 , 0.7830
2 ,  839 ,  672 ,  50 , 1732 , 0.1480 , 0.3610 , 0.6450 , 0.0170 , 1.1710
2 ,  839 ,  672 ,  25 , 3556 , 0.1480 , 0.3730 , 1.2930 , 0.0160 , 1.8300
2 ,  839 ,  672 ,  12 , 6627 , 0.1480 , 0.4160 , 2.3840 , 0.0160 , 2.9640
1 , 1678 , 1344 , 100 , 1854 , 0.1420 , 2.2140 , 2.8510 , 0.0460 , 5.2530
1 , 1678 , 1344 ,  50 , 4519 , 0.1480 , 2.2280 , 6.7440 , 0.0470 , 9.1670
1 , 1678 , 1344 ,  25 , 9083 , 0.1430 , 2.1920 , 13.5850 , 0.0480 , 15.9680
1 , 1678 , 1344 ,  12 , 17869 , 0.1440 , 2.2960 , 26.3940 , 0.0490 , 28.8830
0 , 3356 , 2687 , 100 , 4786 , 0.1250 , 10.9570 , 30.8380 , 0.1570 , 42.0770
0 , 3356 , 2687 ,  50 , 11942 , 0.1310 , 10.7930 , 76.1670 , 0.1710 , 87.2620
0 , 3356 , 2687 ,  25 , 29066 , 0.1250 , 10.7480 , 184.0220 , 0.1720 , 195.0670
0 , 3356 , 2687 ,  12 , 59634 , 0.1240 , 11.0440 , 377.8910 , 0.1790 , 389.2380

这要好得多（至少完成了），尽管在图像/超像素数量大时它仍然变得非常昂贵。

仍有一些选择可以做得更好，但是我们必须考虑到这一点。

走得更远

大图像和超大像素数量仍然表现不佳。这主要是由于以下事实：对于每个超像素，我们需要处理整个标签阵列以确定遮罩，然后处理整个遮罩以确定ROI。超像素很少是矩形的，因此要处理更多不属于当前标签的ROI中的像素（即使它只是测试蒙版）也要浪费更多的工作。

让我们回想一下，输入中的每个位置都可以属于一个超像素（标签）。对于每个标签，我们需要计算属于它的所有像素的平均R，G和B强度（即首先确定每个通道的总和并计算像素数，然后计算均值）。我们应该能够一次在输入图像和关联的标签数组上进行此操作。计算完每个标签的平均颜色后，就可以在标签数组的第二遍中将其用作查找表，并使用适当的颜色填充输出图像。

在Python中，我们可以通过以下方式实现此算法：

注意： 尽管这很冗长，但这是性能最好的版本-为什么我无法确切解释，但它与最佳版本非常接近的Cython功能。

文件fast_labels_python.py

import numpy as np

def super_pixelize(img, labels, superpixel_count):
    rows = img.shape[0]
    cols = img.shape[1]

    assert img.shape[0] == labels.shape[0]
    assert img.shape[1] == labels.shape[1]
    assert img.shape[2] == 3

    sums = np.zeros((superpixel_count, 3), dtype=np.int64)
    counts = np.zeros((superpixel_count, 1), dtype=np.int64)

    for r in range(rows):
        for c in range(cols):
            label = labels[r,c]
            sums[label, 0] = (sums[label, 0] + img[r, c, 0])
            sums[label, 1] = (sums[label, 1] + img[r, c, 1])
            sums[label, 2] = (sums[label, 2] + img[r, c, 2])
            counts[label, 0] = (counts[label, 0] + 1)

    label_colors = np.uint8(sums / counts)

    result = np.zeros_like(img)    
    for r in range(rows):
        for c in range(cols):
            label = labels[r,c]
            result[r, c, 0] = label_colors[label, 0]
            result[r, c, 1] = label_colors[label, 1]
            result[r, c, 2] = label_colors[label, 2]

    return result

基准脚本：

from superpix_harness import *

import fast_labels_python
measure_fn(fast_labels_python.super_pixelize)

时间：

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1530 , 0.0590 , 0.5160 , 0.0070 , 0.7350
3 ,  420 ,  336 ,  50 ,  568 , 0.1470 , 0.0680 , 0.5250 , 0.0070 , 0.7470
3 ,  420 ,  336 ,  25 , 1415 , 0.1480 , 0.0740 , 0.5140 , 0.0070 , 0.7430
3 ,  420 ,  336 ,  12 , 3009 , 0.1490 , 0.0870 , 0.5190 , 0.0070 , 0.7620
2 ,  839 ,  672 , 100 ,  617 , 0.1480 , 0.3770 , 2.0720 , 0.0150 , 2.6120
2 ,  839 ,  672 ,  50 , 1732 , 0.1490 , 0.3680 , 2.0480 , 0.0160 , 2.5810
2 ,  839 ,  672 ,  25 , 3556 , 0.1470 , 0.3730 , 2.0570 , 0.0150 , 2.5920
2 ,  839 ,  672 ,  12 , 6627 , 0.1460 , 0.4140 , 2.0530 , 0.0170 , 2.6300
1 , 1678 , 1344 , 100 , 1854 , 0.1430 , 2.2080 , 8.2970 , 0.0470 , 10.6950
1 , 1678 , 1344 ,  50 , 4519 , 0.1430 , 2.2240 , 8.2970 , 0.0480 , 10.7120
1 , 1678 , 1344 ,  25 , 9083 , 0.1430 , 2.2020 , 8.2280 , 0.0490 , 10.6220
1 , 1678 , 1344 ,  12 , 17869 , 0.1430 , 2.3010 , 8.3210 , 0.0520 , 10.8170
0 , 3356 , 2687 , 100 , 4786 , 0.1270 , 10.8630 , 33.0230 , 0.1580 , 44.1710
0 , 3356 , 2687 ,  50 , 11942 , 0.1270 , 10.7950 , 32.9230 , 0.1660 , 44.0110
0 , 3356 , 2687 ,  25 , 29066 , 0.1260 , 10.7270 , 33.3660 , 0.1790 , 44.3980
0 , 3356 , 2687 ,  12 , 59634 , 0.1270 , 11.0840 , 33.1850 , 0.1790 , 44.5750

作为纯Python实现，这确实遭受了解释器开销的困扰。在较小的图像尺寸和较低的超像素数量下，这种开销占主导地位。但是，我们可以看到，对于给定的图像大小，无论超像素的数量如何，性能都将保持一致。这是一个好兆头。进一步的好兆头是，在足够大的图像尺寸和超像素数时，我们更有效的算法开始获胜。

要做得更好，我们必须避免解释器的开销-这意味着生成一些我们可以编译为二进制Python模块的代码，该模块将在一个Python解释器调用中执行整个操作。

使用Cython

Cython提供了将Python代码转换（注释）为C并将结果编译为二进制Python模块的方法。如果操作正确，可以极大地提高性能。 Cython还包括我们可以利用的support for numpy arrays。

NB： 您将需要阅读Cython教程和文档，并进行一些实验以弄清楚如何注释事物以获得最佳性能（如我已经完成了）–详细的解释远远超出了这个（已经过多）的答案的范围。

文件fast_labels_cython.pyx

# cython: infer_types=True
import numpy as np
cimport cython

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
def super_pixelize(unsigned char[:, :, :] img, int[:, :] labels, int superpixel_count):
    cdef Py_ssize_t rows = img.shape[0]
    cdef Py_ssize_t cols = img.shape[1]

    assert img.shape[0] == labels.shape[0]
    assert img.shape[1] == labels.shape[1]
    assert img.shape[2] == 3

    sums = np.zeros((superpixel_count, 3), dtype=np.int64)
    cdef long long[:, ::1] sums_view = sums

    counts = np.zeros((superpixel_count, 1), dtype=np.int64)
    cdef long long[:, ::1] counts_view = counts

    cdef Py_ssize_t r, c
    cdef int label

    for r in range(rows):
        for c in range(cols):
            label = labels[r,c]
            sums_view[label, 0] = (sums_view[label, 0] + img[r, c, 0])
            sums_view[label, 1] = (sums_view[label, 1] + img[r, c, 1])
            sums_view[label, 2] = (sums_view[label, 2] + img[r, c, 2])
            counts_view[label, 0] = (counts_view[label, 0] + 1)

    label_colors = np.uint8(sums / counts)
    cdef unsigned char[:, ::1] label_colors_view = label_colors

    result = np.zeros_like(img)    
    cdef unsigned char[:, :, ::1] result_view = result

    for r in range(rows):
        for c in range(cols):
            label = labels[r,c]
            result_view[r, c, 0] = label_colors_view[label, 0]
            result_view[r, c, 1] = label_colors_view[label, 1]
            result_view[r, c, 2] = label_colors_view[label, 2]

    return result

编译：

cythonize.exe -2 -i fast_labels_cython.pyx

基准脚本：

from superpix_harness import *

import fast_labels_python
measure_fn(fast_labels_python.super_pixelize)

时间：

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1550 , 0.0600 , 0.0010 , 0.0080 , 0.2240
3 ,  420 ,  336 ,  50 ,  568 , 0.1500 , 0.0680 , 0.0010 , 0.0070 , 0.2260
3 ,  420 ,  336 ,  25 , 1415 , 0.1480 , 0.0750 , 0.0010 , 0.0070 , 0.2310
3 ,  420 ,  336 ,  12 , 3009 , 0.1490 , 0.0880 , 0.0010 , 0.0070 , 0.2450
2 ,  839 ,  672 , 100 ,  617 , 0.1480 , 0.3580 , 0.0040 , 0.0150 , 0.5250
2 ,  839 ,  672 ,  50 , 1732 , 0.1480 , 0.3680 , 0.0050 , 0.0150 , 0.5360
2 ,  839 ,  672 ,  25 , 3556 , 0.1480 , 0.3780 , 0.0040 , 0.0170 , 0.5470
2 ,  839 ,  672 ,  12 , 6627 , 0.1470 , 0.4080 , 0.0040 , 0.0170 , 0.5760
1 , 1678 , 1344 , 100 , 1854 , 0.1440 , 2.2340 , 0.0170 , 0.0450 , 2.4400
1 , 1678 , 1344 ,  50 , 4519 , 0.1430 , 2.2450 , 0.0170 , 0.0480 , 2.4530
1 , 1678 , 1344 ,  25 , 9083 , 0.1440 , 2.2290 , 0.0170 , 0.0480 , 2.4380
1 , 1678 , 1344 ,  12 , 17869 , 0.1460 , 2.3310 , 0.0180 , 0.0500 , 2.5450
0 , 3356 , 2687 , 100 , 4786 , 0.1290 , 11.0840 , 0.0690 , 0.1560 , 11.4380
0 , 3356 , 2687 ,  50 , 11942 , 0.1330 , 10.7650 , 0.0680 , 0.1680 , 11.1340
0 , 3356 , 2687 ,  25 , 29066 , 0.1310 , 10.8120 , 0.0770 , 0.1710 , 11.1910
0 , 3356 , 2687 ,  12 , 59634 , 0.1310 , 11.1200 , 0.0790 , 0.1770 , 11.5070

即使具有最大的图像和将近6万个超像素，处理时间也不到十分之一秒（相比之下，原始版本仅需3个小时多一点）。

使用Boost.Python

另一个选择是直接以较低级的语言实现算法。由于我的熟悉程度，我使用Boost.Python在C ++中实现了一个二进制python模块。该库还支持Numpy数组，因此工作主要是验证输入参数，然后移植算法以使用原始指针。

文件fast_labels.cpp

#define BOOST_ALL_NO_LIB
#include <boost/python.hpp>
#include <boost/python/numpy.hpp>
#include <iostream>

namespace bp = boost::python;

bp::numpy::ndarray super_pixelize(bp::numpy::ndarray const& image
    , bp::numpy::ndarray const& labels
    , int32_t label_count)
{
    if (image.get_dtype() != bp::numpy::dtype::get_builtin<uint8_t>()) {
        throw std::runtime_error("Invalid image dtype.");
    }
    if (image.get_nd() != 3) {
        throw std::runtime_error("Image must be a 3d ndarray.");
    }
    if (image.shape(2) != 3) {
        throw std::runtime_error("Image must have 3 channels.");
    }

    if (labels.get_dtype() != bp::numpy::dtype::get_builtin<int32_t>()) {
        throw std::runtime_error("Invalid label dtype.");
    }
    if (!((labels.get_nd() == 2) || ((labels.get_nd() == 3) && (labels.shape(2) == 1)))) {
        throw std::runtime_error("Labels must have 1 channel.");
    }

    if ((image.shape(0) != labels.shape(0)) || (image.shape(1) != labels.shape(1))) {
        throw std::runtime_error("Image and labels have incompatible shapes.");
    }

    if (label_count < 1) {
        throw std::runtime_error("Must have at least 1 label.");
    }

    bp::numpy::ndarray result(bp::numpy::zeros(image.get_nd(), image.get_shape(), image.get_dtype()));

    int32_t const ROWS(image.shape(0));
    int32_t const COLUMNS(image.shape(1));

    int32_t const ROW_STRIDE_IMAGE(image.strides(0));
    int32_t const COLUMN_STRIDE_IMAGE(image.strides(1));

    int32_t const ROW_STRIDE_LABELS(labels.strides(0));
    int32_t const COLUMN_STRIDE_LABELS(labels.strides(1));

    int32_t const ROW_STRIDE_RESULT(result.strides(0));
    int32_t const COLUMN_STRIDE_RESULT(result.strides(1));

    struct label_info
    {
        int64_t sum_b = 0;
        int64_t sum_g = 0;
        int64_t sum_r = 0;
        int64_t count = 0;
    };

    struct pixel_type
    {
        uint8_t b;
        uint8_t g;
        uint8_t r;
    };

    // Step 1: Collect data for each label
    std::vector<label_info> info(label_count);
    {
        char const* labels_row_ptr(labels.get_data());
        char const* image_row_ptr(image.get_data());
        for (int32_t row(0); row < ROWS; ++row) {
            char const* labels_col_ptr(labels_row_ptr);
            char const* image_col_ptr(image_row_ptr);
            for (int32_t col(0); col < COLUMNS; ++col) {
                int32_t label(*reinterpret_cast<int32_t const*>(labels_col_ptr));
                label_info& current_info(info[label]);

                pixel_type const& pixel(*reinterpret_cast<pixel_type const*>(image_col_ptr));
                current_info.sum_b += pixel.b;
                current_info.sum_g += pixel.g;
                current_info.sum_r += pixel.r;
                ++current_info.count;

                labels_col_ptr += COLUMN_STRIDE_LABELS;
                image_col_ptr += COLUMN_STRIDE_IMAGE;
            }
            labels_row_ptr += ROW_STRIDE_LABELS;
            image_row_ptr += ROW_STRIDE_IMAGE;
        }
    }

    // Step 2: Calculate mean color for each label
    std::vector<pixel_type> label_color(label_count);
    for (int32_t label(0); label < label_count; ++label) {
        label_info& current_info(info[label]);
        pixel_type& current_color(label_color[label]);

        current_color.b = current_info.sum_b / current_info.count;
        current_color.g = current_info.sum_g / current_info.count;
        current_color.r = current_info.sum_r / current_info.count;
    }


    // Step 3: Generate result
    {
        char const* labels_row_ptr(labels.get_data());
        char* result_row_ptr(result.get_data());
        for (int32_t row(0); row < ROWS; ++row) {
            char const* labels_col_ptr(labels_row_ptr);
            char* result_col_ptr(result_row_ptr);
            for (int32_t col(0); col < COLUMNS; ++col) {
                int32_t label(*reinterpret_cast<int32_t const*>(labels_col_ptr));
                pixel_type const& current_color(label_color[label]);

                pixel_type& pixel(*reinterpret_cast<pixel_type*>(result_col_ptr));
                pixel.b = current_color.b;
                pixel.g = current_color.g;
                pixel.r = current_color.r;

                labels_col_ptr += COLUMN_STRIDE_LABELS;
                result_col_ptr += COLUMN_STRIDE_RESULT;
            }
            labels_row_ptr += ROW_STRIDE_LABELS;
            result_row_ptr += ROW_STRIDE_RESULT;
        }
    }

    return result;
}

BOOST_PYTHON_MODULE(fast_labels)
{
    bp::numpy::initialize();

    bp::def("super_pixelize", super_pixelize);
}

编译：

超出了此答案的范围。我使用CMake构建了一个DLL，然后将其重命名为.pyd，以便Python可以识别它。

基准脚本：

from superpix_harness import *

import fast_labels
measure_fn(fast_labels.super_pixelize)

时间：

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1480 , 0.0580 , 0.0010 , 0.0070 , 0.2140
3 ,  420 ,  336 ,  50 ,  568 , 0.1490 , 0.0690 , 0.0010 , 0.0070 , 0.2260
3 ,  420 ,  336 ,  25 , 1415 , 0.1510 , 0.0820 , 0.0010 , 0.0070 , 0.2410
3 ,  420 ,  336 ,  12 , 3009 , 0.1510 , 0.0970 , 0.0010 , 0.0070 , 0.2560
2 ,  839 ,  672 , 100 ,  617 , 0.1490 , 0.3750 , 0.0030 , 0.0150 , 0.5420
2 ,  839 ,  672 ,  50 , 1732 , 0.1480 , 0.7540 , 0.0020 , 0.0160 , 0.9200
2 ,  839 ,  672 ,  25 , 3556 , 0.1490 , 0.7070 , 0.0030 , 0.0160 , 0.8750
2 ,  839 ,  672 ,  12 , 6627 , 0.1590 , 0.7300 , 0.0030 , 0.0160 , 0.9080
1 , 1678 , 1344 , 100 , 1854 , 0.1430 , 3.7120 , 0.0100 , 0.0450 , 3.9100
1 , 1678 , 1344 ,  50 , 4519 , 0.1430 , 2.2510 , 0.0090 , 0.0510 , 2.4540
1 , 1678 , 1344 ,  25 , 9083 , 0.1430 , 2.2080 , 0.0100 , 0.0480 , 2.4090
1 , 1678 , 1344 ,  12 , 17869 , 0.1680 , 2.4280 , 0.0100 , 0.0500 , 2.6560
0 , 3356 , 2687 , 100 , 4786 , 0.1270 , 10.9230 , 0.0380 , 0.1580 , 11.2460
0 , 3356 , 2687 ,  50 , 11942 , 0.1300 , 10.8860 , 0.0390 , 0.1640 , 11.2190
0 , 3356 , 2687 ,  25 , 29066 , 0.1270 , 10.8080 , 0.0410 , 0.1800 , 11.1560
0 , 3356 , 2687 ,  12 , 59634 , 0.1280 , 11.1280 , 0.0410 , 0.1800 , 11.4770

稍好一点，尽管由于我们比确定超像素标签的代码快两个数量级，所以没有必要做进一步的介绍。使用最大的图像和最小的超像素尺寸，我们已经提高了6个数量级以上。

有没有一种快速的方法来添加两个三维数组？

1 个答案:

修复减速

准备优化

基线

优化基准线

走得更远

使用Cython

使用Boost.Python