Question

我有一组数据，其中包含相同细菌的3个样品在24小时内的12个不同时间点内测得的基因表达。我试图通过使用python进行逻辑回归，根据它们的表达值找到每个时间点内最重要的基因。

首先，我尝试通过手动将one vs rest技术应用于应用程序，方法是将我感兴趣的重要基因的时间点分配给输出数组值1，其余部分则分配0。

以下是我使用的代码：

from sklearn.linear_model import LogisticRegression
from numpy.random import seed
seed(1)

X =  genes_data.iloc[:,2:].T
log = LogisticRegression(penalty='l1', solver='liblinear', C=0.24)
for i in range(12):
    y = [(1 if j%12 == i else 0) for j in range(36)]
    model = log.fit(X, y)

    top_10_idx = np.argsort(model.coef_[0])[-10:]
    top_10_values = [model.coef_[0][i] for i in top_10_idx]
    top_10_genes = [genes_list["Name"][i] for i in top_10_idx]
    
    print("The top 10 significant genes in {}h are:".format(TIME_POINTS[i]))
    print(top_10_genes)
    print("The number of nonzero genes is {}\n".format(len(list(filter(lambda x: x<0,model.coef_[0])))))

然后，我将multi_class =“ ovr”参数与LogisticRegression一起使用来完成相同的任务：

X =  genes_data.iloc[:,2:].T

y = [2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 24, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12,
     15, 24, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 15, 24]

log = LogisticRegression(penalty='l1', solver='liblinear', multi_class="ovr", C=0.24)
model = log.fit(X, y)

for j in range(12):
    top_10_idx = np.argsort(model.coef_[j])[-10:]
    top_10_values = [model.coef_[j][i] for i in top_10_idx]
    top_10_genes = [genes_list["Name"][i] for i in top_10_idx]
    
    print("The top 10 significant genes in {}h are:".format(TIME_POINTS[j]))
    print(top_10_genes)
    print("The number of nonzero genes is {}\n".format(len(list(filter(lambda x: x<0,model.coef_[j])))))

我在每个时间点通过每种方法获得的重要基因通常都不同，只有少数例外。我不知道为什么会这样。第二个代码在后台执行与第一个代码基本相同的步骤吗？

多类分类的逻辑回归的实现之间的结果差异

0 个答案: