我在一个非常大的时间序列数据集上运行一个简单的自动编码器。 具体来说:
500000
时间序列组成100000
时间序列组成100000
时间序列组成。每个时间序列都有5994
个时间分量。
我正在集群上运行自动编码器的培训,我可以通过ssh
访问该集群。该集群配备了GPU,因此我安装了tensorflow-gpu
才能使用GPU加速。该集群有2个Nvidia Tesla K80卡。
我面临的问题是我的数据集似乎总是太大而无法由集群处理。结果,即使在向SLURM调度程序启动作业脚本时要求大量内存,我仍会不断在输出中出现“ MemoryErrors”。
我开始考虑使用“累积渐变”来训练我的网络,以解决此问题。但是,我没有这方面的经验,并且正在努力了解要在我的软件中实现此功能需要进行哪些修改。
下面是我在Keras
中的AE代码(后端为tensorflow-gpu
)。实施渐变累积将需要进行哪些更改?
from __future__ import division
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Flatten, Lambda, Activation, Conv1D, MaxPooling1D, UpSampling1D, Reshape, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import optimizers
import keras.backend as K
from tensorflow.keras.layers import Add
import tensorflow as tf
import scipy.io
import sys
import matplotlib.pyplot as plt
import numpy as np
import copy
import h5py
import random
####################
### DATA READING ###
####################
# reading training
train = np.load('./training_0.npy')
for i in range(1, 10):
train_2 = np.load('./training_{}.npy'.format(i))
train = np.vstack((train, train_2))
print('training shape', np.shape(train)) # (500000, 5994)
# reading validation
val = np.load('./training_10.npy')
val_2 = np.load('./training_11.npy')
val = np.vstack((val, val_2))
print('validation shape', np.shape(val)) # (100000, 5994)
#reading testing
test = np.load('./training_12.npy')
test_2 = np.load('./training_13.npy')
test = np.vstack((test, test_2))
print('testing shape', np.shape(test)) # (100000, 5994)
# n. time components
number_of_time_components = np.shape(train)[1] # 5994
###################
### AUTOENCODER ###
###################
# network parameters
Dense_1 = 2048
Dense_2 = 1024
Dense_3 = 512
Dense_4 = 256
# training parameters
epochs = 20
def Encoder():
encoder_input = Input(batch_shape=(None, number_of_time_components))
encoded = Dense(Dense_1,activation = 'tanh')(encoder_input)
encoded = Dense(Dense_2,activation = 'tanh')(encoded)
encoded = Dense(Dense_3,activation = 'tanh')(encoded)
encoded = Dense(Dense_4,activation = 'tanh')(encoded)
return Model(encoder_input, encoded)
def DecoderAE(encoder_input, encoded_input):
decoded_3 = Dense(Dense_3,activation = 'tanh', name='dec_3')(encoded_input)
decoded_2 = Dense(Dense_2,activation = 'tanh', name='dec_2')(decoded_3)
decoded_1 = Dense(Dense_1,activation = 'tanh', name='dec_1')(decoded_2)
decoded = Dense(number_of_time_components,activation = 'tanh', name='dec_out')(decoded_1)
return Model(encoder_input, decoded)
encoder = Encoder()
AE = DecoderAE(encoder.input, encoder.output)
AE.compile(optimizer='adam', loss='mse')
# train the model
history = AE.fit(x = train, y = train,
epochs=epochs,
batch_size=50000,
validation_data = (val, val))
# plot loss
loss = history.history['loss']
val_loss = history.history['val_loss']
range_epochs = range(epochs)
plt.figure()
plt.plot(range_epochs, loss, 'bo', label='Training loss')
plt.plot(range_epochs, val_loss, 'b', label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and validation loss')
plt.legend()
plt.savefig('IMAGES/loss_{}.pdf'.format(filedescriptor))
plt.close()
# predictions for testing
prediction = AE.predict(test)
# plot testing predictions vs original
fig, ax = plt.subplots(5, figsize=(15,30))
for i in range(5):
ax[i].plot(test_original[10000*i], color='blue', label='Original')
ax[i].plot(prediction[10000*i], color='red', label='AE encoded-decoded')
ax[i].set_xlabel('Time components', fontsize='x-large')
ax[i].set_ylabel('Amplitude', fontsize='x-large')
ax[i].set_title('Time series n. {:}'.format(500000+10000*i+1), fontsize='x-large')
ax[i].legend(fontsize='large')
plt.subplots_adjust(hspace=1)
fig.savefig('IMAGES/COMPRESSION_test_{}.pdf'.format(filedescriptor))
plt.close()
编辑
要在通信网络中回答问题,这是我遇到的最新错误,即分段错误
Skipping registering GPU devices...
2020-03-17 11:59:38.147624: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-17 11:59:38.155618: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300180000 Hz
2020-03-17 11:59:38.155743: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5569424a0c00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-17 11:59:38.155761: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-03-17 11:59:38.203600: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5569424a34e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-03-17 11:59:38.203631: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-03-17 11:59:38.203714: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-17 11:59:38.203724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]
2020-03-17 11:59:39.235137: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 23976000000 exceeds 10% of system memory.
2020-03-17 11:59:55.898563: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 23976000000 exceeds 10% of system memory.
containerscript.sh: line 4: 3654 Segmentation fault python Autoencoder.py
srun: error: compute-gpu-0-0: task 0: Exited with exit code 139
这是用于在SLURM上启动作业的作业脚本
#!/bin/bash
#SBATCH -p GPU
#SBATCH -N1
#SBATCH -n1
#SBATCH --mem=100000
#SBATCH --gres=gpu:k80:1
./containerscript.sh
./ containerscript.sh仅包含
python Autoencoder.py