我正在尝试制作一个变分自动编码器来学习编码DNA序列,但是我遇到了意想不到的错误。
我的数据是一个单热阵列。
我遇到的问题是价值错误。当我的输入显然是三维的(100,4008,4)时,它告诉我有一个四维输入。
事实上,当我打印出seq
图层时,它表示它的形状是(?,100,4008,4)。
当我拿出尺寸时,它会给我一个二维错误。
任何帮助将受到高度赞赏!
代码是:
from keras.layers import Input
from keras.layers.convolutional import Conv1D
from keras.layers.core import Dense, Activation, Flatten, RepeatVector, Lambda
from keras import backend as K
from keras.layers.wrappers import TimeDistributed
from keras.layers.recurrent import GRU
from keras.models import Model
from keras import objectives
from one_hot import dna_sequence_to_one_hot
from random import shuffle
import numpy as np
# take FASTA file and convert into array of vectors
seqs = [line.rstrip() for line in open("/home/ubuntu/sequences.fa", "r").readlines() if line[0] != ">"]
seqs = [dna_sequence_to_one_hot(s) for s in seqs]
seqs = np.array(seqs)
# first random thousand are training, next thousand are validation
test_data = seqs[:1000]
validation_data = seqs[1000:2000]
latent_rep_size = 292
batch_size = 100
epsilon_std = 0.01
max_length = len(seqs[0])
charset_length = 4
epochs = 100
def sampling(args):
z_mean_, z_log_var_ = args
# batch_size = K.shape(z_mean_)[0]
epsilon = K.random_normal_variable((batch_size, latent_rep_size), 0., epsilon_std)
return z_mean_ + K.exp(z_log_var_ / 2) * epsilon
# loss function
def vae_loss(x, x_decoded_mean):
x = K.flatten(x)
x_decoded_mean = K.flatten(x_decoded_mean)
xent_loss = max_length * objectives.categorical_crossentropy(x, x_decoded_mean)
kl_loss = - 0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis = -1)
return xent_loss + kl_loss
# Encoder
seq = Input(shape=(100, 4008, 4), name='one_hot_sequence')
e = Conv1D(9, 9, activation = 'relu', name='conv_1')(seq)
e = Conv1D(9, 9, activation = 'relu', name='conv_2')(e)
e = Conv1D(9, 9, activation = 'relu', name='conv_3')(e)
e = Conv1D(10, 11, activation = 'relu', name='conv_4')(e)
e = Flatten(name='flatten_1')(e)
e = Dense(435, activation = 'relu', name='dense_1')(e)
z_mean = Dense(latent_rep_size, name='z_mean', activation = 'linear')(e)
z_log_var = Dense(latent_rep_size, name='z_log_var', activation = 'linear')(e)
z = Lambda(sampling, output_shape=(latent_rep_size,), name='lambda')([z_mean, z_log_var])
encoder = Model(seq, z)
# Decoder
d = Dense(latent_rep_size, name='latent_input', activation = 'relu')(z)
d = RepeatVector(max_length, name='repeat_vector')(d)
d = GRU(501, return_sequences = True, name='gru_1')(d)
d = GRU(501, return_sequences = True, name='gru_2')(d)
d = GRU(501, return_sequences = True, name='gru_3')(d)
d = TimeDistributed(Dense(charset_length, activation='softmax'), name='decoded_mean')(d)
# create the model, compile it, and fit it
vae = Model(seq, d)
vae.compile(optimizer='Adam', loss=vae_loss, metrics=['accuracy'])
vae.fit(x=test_data, y=test_data, epochs=epochs, batch_size=batch_size, validation_data=validation_data)
答案 0 :(得分:2)
在文档中提到我们需要以特定格式提及(None,NumberOfFeatureVectors)的输入。在您的情况下,它将是(无,4)
https://keras.io/layers/convolutional/
将此图层用作模型中的第一个图层时,请提供 input_shape参数(整数元组或无,例如(10,128)for 128个向量的10个向量的序列,或(无,128) 128维向量的可变长度序列。
答案 1 :(得分:2)
将卷积层中的kernel_size指定为元组,而不是整数,即使它只需要维度:
e = Conv1D(9, (9), activation = 'relu', name='conv_1')(seq)
尽管在Keras documentation中声明整数和元组都是有效的,但我发现第二个对于维度来说更方便。
答案 2 :(得分:0)
尝试将输入放到网络中,如下所示:
Input(shape=(None, 4)
一般来说,这是因为你不知道你的序列的长度,但我遇到了同样的问题,并且出于某种原因,当我这样做时它已经解决了
希望它有效!
答案 3 :(得分:0)
我最近解决了这个问题。这将报告错误,因为使用Conv1D函数时您在input_shape中包含了通道。