Question

我在一个非常大的时间序列数据集上运行一个简单的自动编码器。具体来说：

我的训练样本由500000时间序列组成
我的验证集由100000时间序列组成
我的测试集由100000时间序列组成。

每个时间序列都有5994个时间分量。

我正在集群上运行自动编码器的培训，我可以通过ssh访问该集群。该集群配备了GPU，因此我安装了tensorflow-gpu才能使用GPU加速。该集群有2个Nvidia Tesla K80卡。

我面临的问题是我的数据集似乎总是太大而无法由集群处理。结果，即使在向SLURM调度程序启动作业脚本时要求大量内存，我仍会不断在输出中出现“ MemoryErrors”。

我开始考虑使用“累积渐变”来训练我的网络，以解决此问题。但是，我没有这方面的经验，并且正在努力了解要在我的软件中实现此功能需要进行哪些修改。

下面是我在Keras中的AE代码（后端为tensorflow-gpu）。实施渐变累积将需要进行哪些更改？

from __future__ import division
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Flatten, Lambda, Activation, Conv1D, MaxPooling1D, UpSampling1D, Reshape, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import optimizers
import keras.backend as K
from tensorflow.keras.layers import Add

import tensorflow as tf

import scipy.io
import sys
import matplotlib.pyplot as plt
import numpy as np
import copy
import h5py
import random


####################
### DATA READING ###
####################


# reading training
train = np.load('./training_0.npy')
for i in range(1, 10):
    train_2 = np.load('./training_{}.npy'.format(i))    
    train = np.vstack((train, train_2))

print('training shape', np.shape(train))   # (500000, 5994)

# reading validation
val = np.load('./training_10.npy')  
val_2 = np.load('./training_11.npy')
val = np.vstack((val, val_2))

print('validation shape', np.shape(val))  # (100000, 5994)

#reading testing
test = np.load('./training_12.npy') 
test_2 = np.load('./training_13.npy')
test = np.vstack((test, test_2))

print('testing shape', np.shape(test))   # (100000, 5994)


# n. time components
number_of_time_components = np.shape(train)[1]   # 5994


###################
### AUTOENCODER ###
###################

# network parameters
Dense_1 = 2048
Dense_2 = 1024
Dense_3 = 512
Dense_4 = 256

# training parameters
epochs = 20


def Encoder():
    encoder_input = Input(batch_shape=(None, number_of_time_components))
    encoded = Dense(Dense_1,activation = 'tanh')(encoder_input)
    encoded = Dense(Dense_2,activation = 'tanh')(encoded)
    encoded = Dense(Dense_3,activation = 'tanh')(encoded)
    encoded = Dense(Dense_4,activation = 'tanh')(encoded)
    return Model(encoder_input, encoded)


 def DecoderAE(encoder_input, encoded_input):
     decoded_3 = Dense(Dense_3,activation = 'tanh', name='dec_3')(encoded_input)
     decoded_2 = Dense(Dense_2,activation = 'tanh', name='dec_2')(decoded_3)
     decoded_1 = Dense(Dense_1,activation = 'tanh', name='dec_1')(decoded_2)
     decoded = Dense(number_of_time_components,activation = 'tanh', name='dec_out')(decoded_1)
     return Model(encoder_input, decoded)


encoder = Encoder()
AE = DecoderAE(encoder.input, encoder.output)
AE.compile(optimizer='adam', loss='mse')


# train the model
history = AE.fit(x = train, y = train,
                    epochs=epochs,
                    batch_size=50000,
                    validation_data = (val, val))


# plot loss
loss = history.history['loss']
val_loss = history.history['val_loss']
range_epochs = range(epochs)
plt.figure()
plt.plot(range_epochs, loss, 'bo', label='Training loss')
plt.plot(range_epochs, val_loss, 'b', label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and validation loss')
plt.legend()
plt.savefig('IMAGES/loss_{}.pdf'.format(filedescriptor))
plt.close()


# predictions for testing
prediction = AE.predict(test)


# plot testing predictions vs original
fig, ax = plt.subplots(5, figsize=(15,30))
for i in range(5):
        ax[i].plot(test_original[10000*i], color='blue', label='Original')
        ax[i].plot(prediction[10000*i], color='red', label='AE encoded-decoded')
        ax[i].set_xlabel('Time components', fontsize='x-large')
        ax[i].set_ylabel('Amplitude', fontsize='x-large')
        ax[i].set_title('Time series n. {:}'.format(500000+10000*i+1), fontsize='x-large')
        ax[i].legend(fontsize='large')
plt.subplots_adjust(hspace=1)
fig.savefig('IMAGES/COMPRESSION_test_{}.pdf'.format(filedescriptor))
plt.close()

编辑

要在通信网络中回答问题，这是我遇到的最新错误，即分段错误

Skipping registering GPU devices...
2020-03-17 11:59:38.147624: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-17 11:59:38.155618: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300180000 Hz
2020-03-17 11:59:38.155743: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5569424a0c00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-17 11:59:38.155761: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-03-17 11:59:38.203600: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5569424a34e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-17 11:59:38.203631: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-03-17 11:59:38.203714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-17 11:59:38.203724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      
2020-03-17 11:59:39.235137: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 23976000000 exceeds 10% of system memory.
2020-03-17 11:59:55.898563: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 23976000000 exceeds 10% of system memory.
containerscript.sh: line 4:  3654 Segmentation fault      python Autoencoder.py
srun: error: compute-gpu-0-0: task 0: Exited with exit code 139

这是用于在SLURM上启动作业的作业脚本

#!/bin/bash
#SBATCH -p GPU
#SBATCH -N1
#SBATCH -n1
#SBATCH --mem=100000
#SBATCH --gres=gpu:k80:1

./containerscript.sh

./ containerscript.sh仅包含

python Autoencoder.py

内存不足-Keras Tensorflow GPU-渐变累积

0 个答案: