Question

因为有大量的大数据在hdfs上生成大数据。keras是否可以支持直接读取hdfs文件？以前我将大数据放入本地磁盘并读取为开放数据，但是这会花费时间和存储空间

我试图将hdfs文件放入本地磁盘，但这会花费时间。

def generator_array_from_file(path,word2ID):
    X1 = np.zeros((batch_size, text1_maxlen), dtype=np.int32)
    X1_len = np.zeros((batch_size,), dtype=np.int32)
    X2 = np.zeros((batch_size, text2_maxlen), dtype=np.int32)
    X2_len = np.zeros((batch_size,), dtype=np.int32)
    Y = np.zeros((batch_size, class_num), dtype=np.int32)
    count = 0
    while True:
        fts=codecs.open(path, 'r', "utf-8")
        for line in fts:

train_generator = generator_array_from_file(train_set,wordID)
history = model.fit_generator(train_generator)

可以直接读取hdfs文件

Answer 1

是的，为此，有一个HDF5Matrix类，描述为here，该类在HDF5文件顶部模拟了一个numpy数组接口，您只需要使用hdf5文件名创建此类的实例即可以及该文件中的数据集：

from keras.utils import HDF5Matrix

X = HDF5Matrix('file.hdf5', 'data')
y = HDF5Matrix('file.hdf5', 'labels')

model.fit(x, y, epochs=..., batch_size=...)

齿轮可以支持从hdfs中读取文件吗？

1 个答案: