Question

我写了一个脚本，但速度太快了。我想知道是否有人可以建议如何加快速度。我认为脚本太慢的部分是这样的：

我列出了1000个人类基因名称（每个基因名称都是一个数字），读入名为＆＃34; ListOfHumanGenes＆＃34;的列表;例如，列表的开头如下所示：

[2314,2395,10672,8683,5075]

我有100个这样的文件，所有文件都带有扩展名＆＃34; .HumanHomologs＆＃34;：

HumanGene   OriginalGene    Intercept    age    pval 
2314       14248            5.3e-15      0.99   3.5e-33 
2395       14297            15.76       -0.05   0.59 
10672      14674            7.25         0.19   0.58 
8683       108014           21.63       -1.74   0.43 
5075       18503            -6.34        1.58   0.19

该部分脚本的算法是（英文，而不是代码）：

for each gene in ListOfHumanGenes:
    open each of the 100 files labelled ".HumanHomologs"
      if the gene name is present:
           NumberOfTrials +=1
           if the p-val is <0.05: 
                 if the "Age" column < 0:
                       UnderexpressedSuccess +=1
                 elif "Age" column > 0:
                       OverexpressedSuccess +=1
print each_gene + "\t" + NumberOfTrials + "\t" UnderexpressedSuccess
print each_gene + "\t" + NumberOfTrials + "\t" OverexpressedSuccess

此部分的代码是：

for each_item in ListOfHumanGenes:
    OverexpressedSuccess = 0
    UnderexpressedSuccess = 0
    NumberOfTrials = 0
    for each_file in glob.glob("*.HumanHomologs"):
        open_each_file = open(each_file).readlines()[1:]
        for line in open_each_file:
            line = line.strip().split()
            if each_item == line[0]:
                NumberOfTrials +=1    #i.e if the gene is in the file, NumberOfTrials +=1. Not every gene is guaranteed to be in every file
                if line[-1] != "NA":
                    if float(line[-1]) < float(0.05):
                        if float(line[-2]) < float(0):
                            UnderexpressedSuccess +=1
                        elif float(line[-2]) > float(0):
                            OverexpressedSuccess +=1

    underexpr_output_file.write(each_item + "\t" + str(UnderexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(UnderProbability) +"\n") #Note: the "Underprobabilty" float is obtained earlier in the script
    overexpr_output_file.write(each_item + "\t" + str(OverexpressedSuccess) + "\t" + str(NumberOfTrials) + "\t" + str(OverProbability) +"\n") #Note: the "Overprobability" float is obtained earlier in the script
overexpr_output_file.close()
underexpr_output_file.close()

这会生成两个输出文件（一个用于over，一个用于under-expressions），如下所示;列是GeneName，＃Overexpressed / #Underexpressed，＃NumberTrials，然后可以忽略最后一列：

2314    8   100 0.100381689982
2395    14  90  0.100381689982
10672   10  90  0.100381689982
8683    8   98  0.100381689982
5075    5   88  0.100381689982

＆＃34; .HumanHomologs＆＃34;文件中有> 8,000行，基因列表长约20,000个基因。所以我理解这很慢，因为对于20,000个基因中的每一个，它打开100个文件并在＆gt;中找到基因。每个文件8,000个基因。我想知道是否有人可以建议我可以进行编辑以使这个脚本更快/更有效？

Answer 1

您的算法将打开所有这100个文件1000次。立即想到的优化是作为最外层循环遍历文件，这将确保每个文件只打开一次。然后检查每个基因的存在并记录您想要的任何其他记录。

此外，pandas模块在处理这种csv文件时非常方便。看看Pandas

Answer 2

谢谢你的帮助;交换环路的洞察力是非常宝贵的。改进的，远远更有效的脚本如下:(注意：我没有一个ListOfHumanGenes（如上所述），我现在有一个DictOfHumanGenes，其中每个键都是人类基因，值是一个列表（ 1）NumberOfTrials，（2）UnderexpressedSuccess和（3）OverexpressedSuccess;这也加快了我代码的其他部分）：

for each_file in glob.glob("*.HumanHomologs"):
    open_each_file = open(each_file).readlines()[1:]
    for line in open_each_file:
        line = line.strip().split()
        if line[0] in DictOfHumanGenes: 
            DictOfHumanGenes[line[0]][0] +=1  #This is the Number of trials
            if line[-1] != "NA":
                if float(line[-1]) < float(0.05):
                    if float(line[-2]) < float(0):
                        DictOfHumanGenes[line[0]][1] +=1  #This is the UnexpressedSuccess
                    elif float(line[-2]) > float(0):
                        DictOfHumanGenes[line[0]][2] +=1  #This is the OverexpressedSuccess

我现在正在调查大熊猫，看看如何合并它，如果我能用pandas使代码更高效，我会在这里发布答案。

如何提高解析数百个文件中数千行的效率

2 个答案: