我开发了一个层的基本自动编码器来预测基因表达,其中我的输入是一组基因及其表达载体。我首先将数据集规范化为表达式向量的-1到1值并进行训练。
训练损失非常好(> 0.0001)但是当我要求它预测基因时,预测值与某个原始值有很大不同!这让我发疯,因为我认为验证失败意味着准确性,但显然不是这种情况,或者我做错了。
我使用keras作为我的模型,使用tensorflow作为我的后端
这是我的代码
from sklearn.preprocessing import MinMaxScaler, normalize
from keras.layers import Input, Dense, Dropout
from keras.models import Model, Sequential
import numpy as np
import csv
from keras.callbacks import CSVLogger
import tensorflow as tf
from keras import optimizers, regularizers
x_train = []
count = 0
f = open('NormlizedGeneExpression.csv', 'r')
reader = csv.reader(f)
for row in reader:
if count == 0:
# x=0
x_train.append(row)
else:
x_train.append(row)
count +=1
if count == 1002:
break
f.close()
x_test = np.array(x_train)[0, :5000]
x_train = np.array(x_train)[1:1001, :5000]
encodedRepresentation = []
encoding_dim = 2500
number_of_probes = len(x_train[0])
print(number_of_probes)
print(x_train.shape)
noise_factor = 0.025
x_train = x_train.astype('float32')
x_train_noise = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train.shape)
x_train_noise = np.clip(x_train_noise, -1., 1.)
x_train.shape
x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_train.shape
print(x_train_noise)
with tf.device('/gpu:1'):
activation1 = 'relu'
autoencoder = Sequential()
autoencoder.add(Dense(number_of_probes,input_dim = number_of_probes, activity_regularizer=regularizers.l1(10e-5), activation=activation1))
autoencoder.add(Dense(encoding_dim, activation=activation1))
autoencoder.add(Dense(number_of_probes, activation='sigmoid'))
autoencoder.compile(optimizer=optimizers.Adadelta(lr=5), loss='mse')
csv_logger = CSVLogger('epochsOptmizationNormlized.csv', append=True, separator=';')
autoencoder.fit(x_train, x_train, callbacks=[csv_logger], validation_split=0.2, epochs=500,batch_size=4)
predicted = autoencoder.predict(x_train)
meanDiff = (predicted-x_train).mean()
diffMatrix = np.zeros(x_train.shape)
for column in range(len(x_train[0])):
for row in range(len(x_train)):
if abs(x_train[row][column]-predicted[row][column])<= 0.001:
diffMatrix[row][column] = 1
else:
diffMatrix[row][column] = 0
print(diffMatrix.mean())
print(meanDiff)
print("number of matches:" + str(sum(sum(diffMatrix))) +" out of: " +
str(len(diffMatrix)*len(diffMatrix[0])))
仅运行此代码的时间很短,这使我的验证损失为0.0015
Epoch 8/500
800/800 [==============================] - 7s 9ms/step - loss: 0.0015 -
val_loss: 0.0015
以下是数据样本:
array([['0.0036385251687012267', '-0.004624872352987704',
'-0.0012055354474612501', ..., '0.0', '0.0', '0.0'],
['0.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0'],
['0.0', '0.009841553016547296', '0.007671589211117045', ...,
'-0.009578527100737567', '-0.0053043559688294994',
'0.0012712919264136816'],
...,
['0.0015377572411164127', '0.003165970790533791',
'0.004191142284611399', ..., '0.0', '0.0', '0.0'],
['0.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0'],
['0.0', '0.0', '0.0', ..., '-0.005608291114659858',
'-0.007025439944708316', '-0.0003316731304368733']], dtype='<U23')
我拥有的数据包含10到-10之间的值。我将它们标准化,然后将其用作编码器的输入,如上面的示例所示。