我有生物学背景,目前正在进行实验和学习机器学习,以培养我所拥有的微阵列数据集,该数据集由140个细胞系组成,每个细胞系有54871个基因表达。基本上,我有140行,每行由54871列组成,代表一个值,该值是该细胞系的基因表达水平。基本上,一个140 * 54871矩阵。在140个细胞系中,我已经将每行(细胞系)标记为组1或组2,以便我的代码学习辨别和预测我是否输入1 * 54871矩阵,它将属于哪个组。
我已将数据集分为两部分进行培训和测试。我的问题是:由于每个基因表达有54871个权重,我的训练非常慢,因为在每1000次迭代中,我的成本函数(均方误差)仅从0.3057变为0.3047,这将需要大约2-3分钟。此外,随着迭代次数的增加,您可以看到它有点高原使得它似乎需要永远训练,直到模型具有偶数> = 0.1的成本函数。当它以0.3103开始时,我一夜之间醒来时的mse值为0.3014。
我有什么办法可以加快培训过程吗?或者有什么我做错了。谢谢!
这是我的代码,对不起,如果它有点乱:
import pandas as pd
import tensorflow as tf
import numpy
# download csv data sheet of all cell lines
input_data = pd.read_csv(
'C:/Users/lalalalalalala.csv',
index_col=[0, 1],
header=0,
na_values='---')
matrix_data = input_data.as_matrix()
# user define cell lines of interest for supervised training
group1 = input(
"Please enter cell lines that makes up the your cluster of interest with spaces in between(case sensitive):")
group_split1 = group1.split(sep=" ")
# assign label of each: input cluster = 1
# rest of cluster = 0
# extract data of input group
# split training and test set
# all these if else statement represents split when the input group1 is not a even number
split = len(group_split1)
g1_train = input_data.loc[:, group_split1[0:int(split / 2) if len(group_split1) % 2 == 0 else (int(split / 2) + 1)]]
g1_test = input_data.loc[:,
group_split1[(int(split / 2) if len(group_split1) % 2 == 0 else (int(split / 2) + 1)):split]]
g2 = input_data.loc[:, [x for x in list(input_data) if x not in group_split1]]
split2 = g2.shape[1]
g2_train = g2.iloc[:, 0:int(split2 / 2) if len(group_split1) % 2 == 0 else (int(split2 / 2) + 1)]
g2_test = g2.iloc[:, (int(split2 / 2) if len(group_split1) % 2 == 0 else (int(split2 / 2) + 1)):split2]
# amplify the input data if the input data is too small:
amp1 = (int((g2_train.shape[1] - split) / int(split / 2))) if g2_train.shape[
1] >= split else 1 # if g1 is less than g2 amplify
g1_train = pd.DataFrame(pd.np.tile(g1_train, (1, amp1)), index=g2_train.index)
amp2 = (int((g2_test.shape[1] - split) / int(split / 2))) if g2_test.shape[1] >= split else 1
g1_test = pd.DataFrame(pd.np.tile(g1_test, (1, amp2)), index=g2_test.index)
regroup_train = pd.concat([g1_train, g2_train], axis=1, join_axes=[g1_train.index])
regroup_train = numpy.transpose(regroup_train.as_matrix())
regroup_test = pd.concat([g1_test, g2_test], axis=1, join_axes=[g1_test.index])
regroup_test = numpy.transpose(regroup_test.as_matrix())
# create labels
split3 = g1_train.shape[1]
labels_train = numpy.zeros(shape=[len(regroup_train), 1])
labels_train[0:split3] = 1
split4 = g1_test.shape[1]
labels_test = numpy.zeros(shape=[len(regroup_test), 1])
labels_test[0:split4] = 1
# change all nan to 0
regroup_train = numpy.nan_to_num(regroup_train)
regroup_test = numpy.nan_to_num(regroup_test)
labels_train = numpy.nan_to_num(labels_train)
labels_test = numpy.nan_to_num(labels_test)
#######################################################################################################################
#####################################################NEURAL NETWORK####################################################
#######################################################################################################################
# define variables
trainingtimes = 1000
# create model
x = tf.placeholder(tf.float32, [None, 54781])
w = tf.Variable(tf.zeros([54781, 1]))
b = tf.Variable(tf.zeros([1]))
# define linear regression model, loss function
y = tf.nn.sigmoid((tf.matmul(x, w) + b))
# define correct training group
ytt = tf.placeholder(tf.float32, [None, 1])
# define cross optimizer and cost function
mse = tf.reduce_mean(tf.losses.mean_squared_error(y, ytt))
# train step
train_step = tf.train.GradientDescentOptimizer(learning_rate=0.3).minimize(mse)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for i in range(trainingtimes):
sess.run(train_step, feed_dict={x: regroup_train, ytt: labels_train})
if i % 100 == 0:
print(sess.run(mse, feed_dict={x: regroup_train, ytt: labels_train}))
答案 0 :(得分:1)
这里有一些关键问题。您正在尝试定义一个单层神经网络,这听起来很适合这个问题。但是你的隐藏层比它应该大得多。尝试较小的重量。尝试128,256,512这样的数字(不需要2的幂)。
此外,您的输入维度非常高。我知道有人正在研究一个非常相似的癌症基因表达问题,有60,000个基因表达和10,000个样本。她使用PCA来减少数据的维数,同时保持约90%的方差(她尝试了不同的值,发现这是最优的)。
这改善了结果。神经网络可以过度拟合,PCA维数降低是有益的。在她的实验中,1层完全连接的网络也进行了Logstic回归和XGA增强。
她正在解决此问题的其他一些事情,这些事情也可能适用于您: