Question

我有〜650,000个图像文件，我用cv2转换为numpy数组。图像被排列成子文件夹，每个子文件中有~10k个图像。每张图片都很小;约600字节（2x100像素RGB）。

当我全部阅读时使用：

cv2.imread()

每10k图像需要半秒钟，所有650k都需要一分钟......除非我重新启动机器。然后，重新启动后第一次运行脚本时，每10k图像需要20-50秒;完整阅读半小时左右。

为什么？

如何在重新启动后快速访问它们，而不会进行极慢的初始读取？

历史图像数据库每天都在增长;旧的不会重写。

代码：

print 'Building historic database...'
elapsed = elapsed2 = time.time()
def get_immediate_subdirectories(a_dir):
    return [name for name in os.listdir(a_dir)
            if os.path.isdir(os.path.join(a_dir, name))]
compare = get_immediate_subdirectories('images_old')
compare.sort()

images = []
for j in compare:
    begin = 1417024800
    end =  1500000000
    if ASSET == j:
        end = int(time.time()-86400*30)
    tally = 0
    for i in range (begin, end, 7200):
        try:
            im = cv2.imread("images_old/%s/%s_%s.png" % (j,j,i))
            im = np.ndarray.flatten(im)
            if im is not None:  
                images.append([j,i,im])
                tally+=1
        except: pass
    print  j.ljust(5), ('cv2 imread elapsed: %.2f items: %s' % ((time.time()-elapsed),tally))
    elapsed = time.time()
print '%.2f cv2 imread big data: %s X %s items' % ((time.time()-elapsed2),len(images),len(a1))
elapsed = time.time()

amd fm2 + 16GB linux mint 17.3 python 2.7

Answer 1

我想建议一个基于REDIS的概念，它就像一个数据库，但实际上是一个“数据结构服务器”，其中数据结构是你的600字节图像。我并不建议您依赖REDIS作为永久存储系统，而是继续使用您的650,000个文件，但将它们缓存在REDIS中，这是免费的，可用于Linux，macOS和Windows。

因此，基本上，在当天的任何时候，您都可以将图像复制到REDIS中，以备下次重启。

我不会说Python，但这是一个Perl脚本，我用它生成650,000个每个600个随机字节的图像，并将它们插入到REDIS哈希中。相应的Python很容易编写：

#!/usr/bin/perl
################################################################################
# generator <number of images> <image size in bytes>
# Mark Setchell
# Generates and sends "images" of specified size to REDIS
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);

my $Debug=1;    # set to 1 for debug messages

my $nargs = $#ARGV + 1;
if ($nargs != 2) {
    print "Usage: generator <number of images> <image size in bytes>\n";
    exit 1;
}

my $nimages=$ARGV[0];
my $imsize=$ARGV[1];
my @bytes=(q(a)..q(z),q(A)..q(Z),q(0)..q(9));
my $bl = scalar @bytes - 1;

printf "DEBUG: images: $nimages, size: $imsize\n" if $Debug;

# Connection to REDIS
my $redis = Redis->new;
my $start=time;

for(my $i=0;$i<$nimages;$i++){
   # Generate our 600 byte "image"
   my $image;
   for(my $j=0;$j<$imsize;$j++){
      $image .= $bytes[rand $bl];
   }
   # Load it into a REDIS hash called 'im' indexed by an integer number
   $redis->hset('im',$i,$image);
   print "DEBUG: Sending key:images, field:$i, value:$image\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Sent $nimages images of $imsize bytes in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)

因此，您可以将650,000个600字节的图像插入到一个名为“im”的REDIS哈希中，该哈希值由一个简单的数字[1..650000]索引。

现在，如果你停止REDIS并检查数据库的大小，它是376MB：

ls -lhrt dump.rb

-rw-r--r--  1 mark  admin   376M 29 May 20:00 dump.rdb

如果您现在杀死REDIS并重新启动它，则需要2.862秒才能启动并加载650,000个图像数据库：

redis-server /usr/local/etc/redis.conf

                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 3.2.9 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 33802
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

33802:M 29 May 20:00:57.698 # Server started, Redis version 3.2.9
33802:M 29 May 20:01:00.560 * DB loaded from disk: 2.862 seconds
33802:M 29 May 20:01:00.560 * The server is now ready to accept connections on port 6379

因此，您可以在重启后的3秒内启动REDIS。然后，您可以查询并加载650,000张图像，如下所示：

#!/usr/bin/perl
################################################################################
# reader
# Mark Setchell
# Reads specified number of images from Redis
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);

my $Debug=0;    # set to 1 for debug messages
my $nargs = $#ARGV + 1;
if ($nargs != 1) {
    print "Usage: reader <number of images>\n";
    exit 1;
}

my $nimages=$ARGV[0];

# Connection to REDIS
my $redis = Redis->new;
my $start=time;

for(my $i=0;$i<$nimages;$i++){
   # Retrive image from hash named "im" with key=$1
   my $image = $redis->hget('im',$i);
   print "DEBUG: Received image $i\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Received $nimages images in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)

在我的Mac上，在61秒内读取650,000张600字节的图像，因此您的总启动时间为64秒。

抱歉，我还不知道用Python做足够的Python，但我怀疑时间会非常相似。

我基本上使用名为“im”的REDIS哈希，hset和hget，并通过一个简单的整数索引图像。但是，REDIS密钥是二进制安全的，因此您可以使用文件名作为键而不是整数。您还可以在命令行（没有Python或Perl）与REDIS交互，这样您就可以在命令行中获得650,000个键（文件名）的列表：

redis-cli <<< "hkeys im"

或使用：

检索单个图像（键/文件名=“1”）

 redis-cli <<< "hget 'im' 1"

如果您没有bash，则可以执行以下操作：

echo "hget 'im' 1" | redis-cli

或

echo "hkeys im" | redis-cli

我刚刚阅读有关持久化/序列化Numpy数组的内容，因此这可能是一个比REDIS更简单的选项... see here。

Answer 2

我一夜之间在想，有一个更简单，更快的解决方案......

基本上，在白天您喜欢的任何时候，您都会解析现有图像文件的文件系统，并在两个文件中对它们进行展平。然后，当你启动时，你只需读取扁平表示，这是磁盘上的一个300MB连续文件，可以在2-3秒内读取。

因此，第一个文件名为"flat.txt"，每个文件包含一行，就像这样，但实际上是650,000行：

filename:width:height:size
filename:width:height:size
...
filename:width:height:size

第二个文件只是一个二进制文件，其中附加了每个列出的文件的内容 - 因此它是一个连续的360 MB二进制文件，名为"flat.bin"。

以下是我使用名为Perl

的脚本在flattener.pl中创建两个文件的方法

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;

# Names of the index and bin files
my $idxname="flat.txt";
my $binname="flat.bin";

# Open index file, which will have format:
#    fullpath:width:height:size
#    fullpath:width:height:size
open(my $idx,'>',$idxname);

# Open binary file - simply all images concatenated
open(my $bin,'>',$binname);

# Save time we started parsing filesystem
my $atime = my $mtime = time;

find(sub {
  # Only parse actual files (not directories) with extension "png"
  if (-f and /\.png$/) {
    # Get full path filename, filesize in bytes
    my $path   = $File::Find::name;
    my $nbytes = -s;
    # Write name and vital statistics to index file
    print $idx "$path:100:2:$nbytes\n";
    # Slurp entire file and append to binary file
    my $image = do {
       local $/ = undef;
       open my $fh, "<", $path;
       <$fh>;
    };
    print $bin $image;
  }
}, '/path/to/top/directory');

close($idx);
close($bin);

# Set atime and mtime of index file to match time we started parsing
utime $atime, $mtime, $idxname || warn "Couldn't touch $idxname: $!";

然后，当您想要启动时，运行loader.pl，如下所示：

#!/usr/bin/perl
use strict;
use warnings;

# Open index file, which will have format:
#    fullpath:width:height:size
#    fullpath:width:height:size
open(my $idx, '<', 'flat.txt');

# Open binary file - simply all images concatenated
open(my $bin, '<', 'flat.bin');

# Read index file, one line at a time
my $total=0;
my $nfiles=0;
while ( my $line = <$idx> ) {
    # Remove CR or LF from end of line
    chomp $line;

    # Parse line into: filename, width, height and size
    my ($name,$width,$height,$size) = split(":",$line);

    print "Reading file: $name, $width x $height, bytes:$size\n";
    my $bytes_read = read $bin, my $bytes, $size;
    if($bytes_read != $size){
       print "ERROR: File=$name, expected size=$size, actually read=$bytes_read\n"
    }
    $total += $bytes_read;
    $nfiles++;
}
print "Read $nfiles files, and $total bytes\n";

close($idx);
close($bin);

这需要不到3秒，每个文件有497,000个，每个600字节。

那么，自运行flattener.pl脚本以来发生了哪些变化的文件呢？好吧，在flattener.pl脚本的开头，我获得了自纪元以来的系统时间。然后，最后，当我完成解析650,000个文件并将已展平的文件写出来之后，我将修改时间设置回到我开始解析之前。然后在您的代码中，您需要做的就是使用loader.pl脚本加载文件，然后快速find比索引文件更新的所有图像文件，并使用您现有的文件加载这些额外的文件方法

在bash中，那将是：

find . -newer flat.txt -print

当您使用 OpenCV 阅读图像时，您需要对原始文件数据执行imdecode()，因此我会在展平或加载时对您是否要这样做进行基准测试

再次，对不起，它是在Perl中，但我确信它可以在Python中完成相同的操作。

Answer 3

您是否检查过磁盘不是瓶颈？第一次读取后，操作系统可以缓存图像文件，然后从内存中使用。如果你的所有文件足够大（10-20Gb），那么慢速硬盘读取可能需要几分钟。

Answer 4

您是否在sepEndBy循环上尝试了数据并行以缓解硬盘访问瓶颈？ for j in compare:可用于为每个CPU核心（或硬件线程）执行一个任务。有关示例，请参阅此using-multiprocessing-queue-pool-and-locking。

如果multiprocessing Intel i7 8 virtual cores，理论上经过的时间可能会减少到1/8。缩短的实际时间还取决于您的HDD或SSD以及SATA界面类型等的访问时间。

如何在重启后在python中更快地执行opencv cv2 imread

4 个答案: