首先,我尝试通过手动将one vs rest技术应用于应用程序,方法是将我感兴趣的重要基因的时间点分配给输出数组值1,其余部分则分配0。
以下是我使用的代码:
from sklearn.linear_model import LogisticRegression
from numpy.random import seed
seed(1)
X = genes_data.iloc[:,2:].T
log = LogisticRegression(penalty='l1', solver='liblinear', C=0.24)
for i in range(12):
y = [(1 if j%12 == i else 0) for j in range(36)]
model = log.fit(X, y)
top_10_idx = np.argsort(model.coef_[0])[-10:]
top_10_values = [model.coef_[0][i] for i in top_10_idx]
top_10_genes = [genes_list["Name"][i] for i in top_10_idx]
print("The top 10 significant genes in {}h are:".format(TIME_POINTS[i]))
print(top_10_genes)
print("The number of nonzero genes is {}\n".format(len(list(filter(lambda x: x<0,model.coef_[0])))))
然后,我将multi_class =“ ovr”参数与LogisticRegression一起使用来完成相同的任务:
X = genes_data.iloc[:,2:].T
y = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 24, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12,
15, 24, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 24]
log = LogisticRegression(penalty='l1', solver='liblinear', multi_class="ovr", C=0.24)
model = log.fit(X, y)
for j in range(12):
top_10_idx = np.argsort(model.coef_[j])[-10:]
top_10_values = [model.coef_[j][i] for i in top_10_idx]
top_10_genes = [genes_list["Name"][i] for i in top_10_idx]
print("The top 10 significant genes in {}h are:".format(TIME_POINTS[j]))
print(top_10_genes)
print("The number of nonzero genes is {}\n".format(len(list(filter(lambda x: x<0,model.coef_[j])))))
我在每个时间点通过每种方法获得的重要基因通常都不同,只有少数例外。我不知道为什么会这样。第二个代码在后台执行与第一个代码基本相同的步骤吗?