具有大量不同样本的多组输入会导致存储错误

时间:2019-07-14 10:19:38

标签: python numpy tensorflow keras

我正在喀拉拉邦创建一个深度学习模型,其中包含两组输入数据:.catchinput1

input2input1个样本,而Ninput2个样本。两组输入在连接在一起之前将经过DNN的不同层。原始输出数据具有M个样本。

我想要的是每对输入输出都是:

M*N

我知道在keras ([input1[i], input[j]], output[i,j]) 中,我必须插入相同样本编号的model.fit。但是,由于我的两个输入都很大,因此当我尝试重复[input1, input2] M次或input1 N次以使其采样数相等时,会导致内存错误。

我不知道在keras或tensorflow中是否有一种方法可以像我描述的那样教模型适合样本对,但是不需要重复输入。

1 个答案:

答案 0 :(得分:0)

好问题。通常用于解决此问题的方法是数据生成器,即生成器对象,该对象可以在训练批期间每批创建数据,从而节省内存空间。

这是一个基于虚拟数据的数据生成器的最小工作示例:

!pip install tensorflow-gpu==1.14.0
import numpy as np

import tensorflow as tf
from tensorflow import keras

# creating dummy data

M = 42
N = 123

lat_dim = 12

x_data_1 = np.random.rand(M,lat_dim)
x_data_2 = np.random.rand(N,lat_dim)

y_data   = np.random.rand(N*M)

# creating data generator

class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(
        self,
        x_data_1,
        x_data_2,
        batch_size=32,
        shuffle=True
    ):
        'Initialization'
        self.x_data_1 = x_data_1
        self.x_data_2 = x_data_2
        self.batch_size = batch_size
        self.shuffle = shuffle

        self.M = self.x_data_1.shape[0]
        self.N = self.x_data_2.shape[0]
        # trigger on_epoch_end() once for initialization
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(self.M * self.N / self.batch_size))

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        # create list from index to m and n
        self.index_to_m_and_n = [
            (m,n) for m in range(self.M) for n in range(self.N)
        ]

        self.index_to_m_and_n = [
            (index,mn_tuple) for index,mn_tuple in enumerate(self.index_to_m_and_n)
        ]

        if self.shuffle == True:
            np.random.shuffle(self.index_to_m_and_n)

    def __getitem__(self, batch_index):
        'Generate one batch of data'

        # Generate indexes of the batch. Index can be between 0 and
        # DataGenerator.__len__ - 1
        batch_index_to_m_and_n = self.index_to_m_and_n[
            batch_index*self.batch_size:(batch_index+1)*self.batch_size
        ]

        # Generate data
        X, Y = self.__data_generation(batch_index_to_m_and_n)

        # return data
        return X, Y

    def __data_generation(self, batch_index_to_m_and_n):
        'Generates data based on batch indices'
        X1 = self.x_data_1[
            [ind_to_mn[1][0] for ind_to_mn in batch_index_to_m_and_n]
        ]

        X2 = self.x_data_2[
            [ind_to_mn[1][1] for ind_to_mn in batch_index_to_m_and_n]
        ]

        Y = np.array([
            y_data[ind_to_mn[0]]
            for ind_to_mn in batch_index_to_m_and_n
        ])

        return [X1,X2], Y

# test data generator
my_data_generator = DataGenerator(
    x_data_1,
    x_data_2,
    batch_size = 32,
    shuffle = True
)

[X1,X2],Y = my_data_generator.__getitem__(5)
print(
    'data shapes:',
    X1.shape,
    X1.shape,
    Y.shape
)

输出为:

data shapes: (32, 12) (32, 12) (32,)

如您所见,生成器所做的只是保存格式为[ij,[i,j]]的列表。仅在调用[x_data_1,x_data_2],y_data时,才为单批数据呈现__getitem__()形式的模型的输入数据。

您可以将数据生成器与model.fit_generator()方法一起使用,而不是像通常那样使用model.fit(),如示例模型所示:

# create simple model

x1 = keras.layers.Input(shape = lat_dim)
x2 = keras.layers.Input(shape = lat_dim)

x  = keras.layers.concatenate([x1,x2])
fx = keras.layers.Dense(1)(x)

model = keras.models.Model(inputs=[x1,x2], outputs=fx)

model.compile(loss='mean_squared_error', optimizer='sgd')

print(model.summary())

model.fit_generator(
    generator = DataGenerator(
        x_data_1,
        x_data_2,
        batch_size = 32,
        shuffle = True
    )
)

输出为:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 12)]         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 12)]         0                                            
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 24)           0           input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (None, 1)            25          concatenate[0][0]                
==================================================================================================
Total params: 25
Trainable params: 25
Non-trainable params: 0
__________________________________________________________________________________________________
None
Epoch 1/5
161/161 [==============================] - 1s 3ms/step - loss: 0.2074
Epoch 2/5
161/161 [==============================] - 0s 3ms/step - loss: 0.1413
Epoch 3/5
161/161 [==============================] - 0s 3ms/step - loss: 0.1197
Epoch 4/5
161/161 [==============================] - 0s 3ms/step - loss: 0.1071
Epoch 5/5
161/161 [==============================] - 0s 3ms/step - loss: 0.0994

在每个时期,.fit_generator()-方法自动在__getitem__(self, batch_index)batch_index之间为0调用__len__(self) - 1,从而使用了整个数据集。在每个时期之后,将调用on_epoch_end(self),然后重新排列数据。