Question

我正在尝试通过生成器将1D numpy数组（平面图像）输入到H5py数据文件中，以便创建训练和验证矩阵。

以下代码改编自一个解决方案（现在找不到），其中H5py的data对象的File函数的create_dataset属性以a的形式提供数据调用np.fromiter，它有一个生成器函数作为其参数之一。

from scipy.misc import imread
import h5py
import numpy as np
import os

# Creating h5 data file
f = h5py.File('../data.h5', 'w')

# Source directory for image data
src = '/datasets/aic540/train/images/'

# Showing quantity and dimensionality of data
images = os.listdir(src)
ex_img = imread(src + images[0])
flat_img = ex_img.flatten()
print "# of images is {}".format(len(images))
print "image shape is {}".format(ex_img.shape)
print "flattened image shape is {}".format(flat_img.shape)

# Creating generator to feed in data to h5py's `create_dataset` function
gen = (imread(src + i).flatten().astype(np.int8) for i in os.listdir(src))

# Creating h5 dataset
f.create_dataset(name='training',
                 #shape=(59482, 1555200),
                 data=np.fromiter(gen, dtype=np.int8))

输出：

# of images is 59482
image shape is (540, 960, 3)
flattened image shape is (1555200,)
Traceback (most recent call last):
  File "process_images.py", line 30, in <module>
    data=np.fromiter(gen, dtype=np.int8))
ValueError: setting an array element with a sequence.

我在这个上下文中搜索这个错误时已经读过，问题是np.fromiter()需要一个列表而不是一个生成器函数（这似乎与名称“fromiter”暗示的函数相反） - 将生成器包装在列表调用中list(gen)允许代码运行，但当然，在调用create_dataset之前，它会占用此列表扩展中的所有内存。

如何使用生成器将数据提供给H5py数据文件？

如果我的方法完全错误，那么使用H5py或其他方法构建一个不适合内存的非常大的numpy矩阵的正确方法是什么？

Answer 1

with a sequence错误来自您尝试提供fromiter的内容，而不是生成器部分。

在py3中，range是生成器，如：

In [15]: np.fromiter(range(3),dtype=int)
Out[15]: array([0, 1, 2])
In [16]: np.fromiter((2*x for x in range(3)),dtype=int)
Out[16]: array([0, 2, 4])

但是，如果我从一个2d数组开始（imread产生，对吗？），并像你一样创建一个生成器表达式：

In [17]: gen = (np.ones((2,3)).flatten().astype(np.int8) for i in range(3))
In [18]: list(gen)
Out[18]: 
[array([1, 1, 1, 1, 1, 1], dtype=int8),
 array([1, 1, 1, 1, 1, 1], dtype=int8),
 array([1, 1, 1, 1, 1, 1], dtype=int8)]

我生成一个数组列表。

In [19]: gen = (np.ones((2,3)).flatten().astype(np.int8) for i in range(3))
In [21]: np.fromiter(gen, np.int8)
...
ValueError: setting an array element with a sequence.

np.fromiter从迭代器创建一个1d数组，该数组提供＆＃39;数字＆＃39;一次一个，而不是列出列表或阵列的东西。

无论如何，npfromiter会创建一个完整的数组;不是某种发电机。没有什么比数组＆＃39;生成器＆＃39;。

即使没有分块，您也可以通过＆＃39; row＆＃39;将数据写入文件。或其他切片。

In [28]: f = h5py.File('test.h5', 'w')
In [29]: data = f.create_dataset(name='test',shape=(100,10))
In [30]: for i in range(100):
    ...:     data[i,:] = np.arange(i,i+10)
    ...:     
In [31]: data
Out[31]: <HDF5 dataset "test": shape (100, 10), type "<f4">

在您的情况下，等效的是加载图像，对其进行整形，并立即将其写入h5py数据集。无需收集数组或列表中的所有图像。

阅读10行：

In [33]: data[:10,:]
Out[33]: 
array([[  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.],
       [  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.],
       [  2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.],
       [  3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.],
       [  4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.],
       [  5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.],
       [  6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.,  15.],
       [  7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.],
       [  8.,   9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.],
       [  9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.]], dtype=float32)

启用分块可能对真正庞大的数据集有所帮助，但我在该领域没有经验。

H5py - 使用生成器创建数据集 - ValueError：使用序列

1 个答案: