Question

我的文件夹（5M +）中有很多图像文件。这些图像大小不同。我想将这些图像调整为128x128。

我在循环中使用以下函数在Python中使用OpenCV调整大小

def read_image(img_path):
    # print(img_path)
    img = cv2.imread(img_path)
    img = cv2.resize(img, (128, 128))
    return img

for file in tqdm(glob.glob('train-images//*.jpg')):
    img = read_image(file)
    img = cv2.imwrite(file, img)

但这将需要7个多小时才能完成。我想知道是否有任何方法可以加快这一过程。

我可以实现并行处理以使用dask之类的东西有效地执行此操作吗？如果可以，那怎么可能。？

Answer 1

如果您绝对打算在Python中执行此操作，请忽略我的回答。如果您有兴趣简单快速地完成工作，请继续阅读...

我建议 GNU Parallel ，如果您有很多事情要做并行处理，甚至更多，因为随着CPU成为具有更多内核而不是的“ fatter” >“ taller” ，具有更高的时钟频率（GHz）。

最简单的是，您可以像这样在Linux，macOS和Windows的命令行中使用 ImageMagick 来调整一堆图像的大小：

magick mogrify -resize 128x128\! *.jpg

如果您有数百张图片，则最好并行运行：

parallel magick mogrify -resize 128x128\! ::: *.jpg

如果您有数百万个图像，*.jpg的扩展将溢出Shell的命令缓冲区，因此您可以使用以下内容在stdin上输入图像名称，而不必将其作为参数传递：

find -iname \*.jpg -print0 | parallel -0 -X --eta magick mogrify -resize 128x128\!

这里有两个“技巧” ：

我将find ... -print0与parallel -0一起使用以空字符结尾的文件名，因此它们之间没有空格，
我使用parallel -X意味着 GNU Parallel 而不是为每个图像启动全新的mogrify进程，而是计算出多少文件名{{1} }可以接受，并批量给它。

我向您推荐这两种工具。

虽然上述答案的 ImageMagick 方面在Windows上均可运行，但我不使用Windows，也不确定在那里使用 GNU Parallel 。我认为它可能在mogrify下运行和/或在git-bash下运行-您可以尝试询问一个单独的问题-它们是免费的！

关于ImageMagick部分，我认为您可以使用以下命令获得文件中所有JPEG文件名的列表：

Cygwin

然后您可以可能像这样处理它们（不是并行进行）：

DIR /S /B *.JPG > filenames.txt

如果您了解如何在Windows上运行 GNU Parallel ，则可以使用以下类似方式可能并行处理它们：

magick mogrify -resize 128x128\! @filenames.txt

Answer 2

如果这些图像存储在磁性硬盘驱动器上，您可能会发现自己受到读/写速度的限制（在旋转的磁盘上，许多小的读写操作非常慢）。

否则，您总是可以在处理池中使用多个内核来抛出问题：

from multiprocessing.dummy import Pool
from multiprocessing.sharedctypes import Value
from ctypes import c_int
import time, cv2, os

wdir = r'C:\folder full of large images'
os.chdir(wdir)

def read_imagecv2(img_path, counter):
    # print(img_path)
    img = cv2.imread(img_path)
    img = cv2.resize(img, (128, 128))
    cv2.imwrite('resized_'+img_path, img) #write the image in the child process (I didn't want to overwrite my images)
    with counter.get_lock(): #processing pools give no way to check up on progress, so we make our own
        counter.value += 1

if __name__ == '__main__':
    # start 4 worker processes
    with Pool(processes=4) as pool: #this should be the same as your processor cores (or less)
        counter = Value(c_int, 0) #using sharedctypes with mp.dummy isn't needed anymore, but we already wrote the code once...
        chunksize = 4 #making this larger might improve speed (less important the longer a single function call takes)
        result = pool.starmap_async(read_imagecv2, #function to send to the worker pool
                                    ((file, counter) for file in os.listdir(os.getcwd()) if file.endswith('.jpg')),  #generator to fill in function args
                                    chunksize) #how many jobs to submit to each worker at once
        while not result.ready(): #print out progress to indicate program is still working.
            #with counter.get_lock(): #you could lock here but you're not modifying the value, so nothing bad will happen if a write occurs simultaneously
            #just don't `time.sleep()` while you're holding the lock
            print("\rcompleted {} images   ".format(counter.value), end='')
            time.sleep(.5)
        print('\nCompleted all images')

由于cv2的{{3}}问题在多处理中效果不佳，我们可以通过将multiprocessing.Pool替换为multiprocessing.dummy.Pool来使用线程而不是进程。无论如何，许多openCV函数都会释放GIL，因此我们仍然应该看到一次使用多个内核的计算优势。另外，由于线程不如进程那么繁重，因此减少了一些开销。经过一番调查，我还没有找到一个可以很好地与流程配合使用的图像库。尝试使某个函数发送给子进程（如何将工作项发送给子进程进行计算）时，它们似乎都失败了。

在OpenCV Python中更快地调整图像大小

2 个答案: