我再次发帖,因为我没有运气试图提高以下脚本的效率。有关更多详细信息,请查看我的previous post,但基本情况如下。
我写了一个脚本来计算得分以及一系列遗传图谱的频率。
这里的遗传图谱由SNP的组合组成。每个SNP有两个等位基因。因此,3个SNP的输入文件如下所示,其中显示了所有3个SNP的所有等位基因的所有可能组合。该表是使用itertool的产品在另一个脚本中生成的:
AA CC TT
AT CC TT
TT CC TT
AA CG TT
AT CG TT
TT CG TT
AA GG TT
AT GG TT
TT GG TT
AA CC TA
AT CC TA
TT CC TA
AA CG TA
AT CG TA
TT CG TA
AA GG TA
AT GG TA
TT GG TA
AA CC AA
AT CC AA
TT CC AA
AA CG AA
AT CG AA
TT CG AA
AA GG AA
AT GG AA
TT GG AA
然后我得到另一个文件,该文件的表包含三个SNP的权重和频率,如下所示:
SNP1 A T 1.25 0.223143551314 0.97273
SNP2 C G 1.07 0.0676586484738 0.3
SNP3 T A 1.08 0.0769610411361 0.1136
这些列是SNP ID,风险等位基因,参考等位基因,OR,log(OR)和总体频率。权重用于风险等位基因。
主脚本会提取这两个文件,并根据每个SNP中每个风险等位基因的每个风险等位基因的对数比值之和,以及基于乘以等位基因频率(假设哈代)的频率,计算得分温伯格平衡。
import sys
snp={}
riskall={}
weights={}
freqs={} # effect allele, *MAY NOT BE MINOR ALLELE
pop = int(int(sys.argv[4]) + 4) # for additional columns due to additional populations. the example table given only has one population (column 6)
# read in OR table
pos = 0
with open(sys.argv[1], 'r') as f:
for line in f:
snp[pos]=(line.split()[0])
riskall[line.split()[0]]=line.split()[1]
weights[line.split()[0]]=line.split()[4]
freqs[line.split()[0]]=line.split()[pop]
pos+=1
### compute scores for each combination
with open(sys.argv[2], 'r') as f:
for line in f:
score=0
freq=1
for j in range(len(line.split())):
rsid=snp[j]
riskallele=riskall[rsid]
frequency=freqs[rsid]
wei=weights[rsid]
allele1=line.split()[j][0]
allele2=line.split()[j][1]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-float(frequency))*(1-float(frequency))
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=float(wei)
freq*=2*(1-float(frequency))*(float(frequency))
elif allele1 == riskallele: # and allele2 == riskall[snp[j]]: # homozygous for risk, be sure to limit risk to second allele!
score+=2*float(wei)
freq*=float(frequency)*float(frequency)
if freq < float(sys.argv[3]): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))
我设置了一个变量,可以在其中指定一个阈值,以在频率变得极低时打破循环。可以进行哪些改进以加快脚本的运行速度?
我尝试使用Pandas,但速度仍然慢得多,因为我不确定在这种情况下是否可以进行矢量化。我在Unix服务器上安装Dask时遇到问题。我还确保只使用Python字典,而不使用列表,这做了些微改进。
上面的预期输出将是这样的:
GG,AA,GG 0 0.000286302968304
GG,AA,GA 0.0769610411361 7.33845153414e-05
GG,AA,AA 0.153922082272 4.70243735491e-06
GG,AG,GG 0.0676586484738 0.00024540254426
GG,AG,GA 0.14461968961 6.29010131498e-05
GG,AG,AA 0.221580730746 4.03066058992e-06
GG,GG,GG 0.135317296948 5.25862594844e-05
GG,GG,GA 0.212278338084 1.34787885321e-05
GG,GG,AA 0.28923937922 8.63712983555e-07
GA,AA,GG 0.223143551314 0.0204250448374
GA,AA,GA 0.30010459245 0.00523530030129
GA,AA,AA 0.377065633586 0.000335475019306
GA,AG,GG 0.290802199788 0.0175071812892
GA,AG,GA 0.367763240924 0.00448740025824
GA,AG,AA 0.44472428206 0.000287550016548
GA,GG,GG 0.358460848262 0.00375153884769
GA,GG,GA 0.435421889398 0.000961585769624
GA,GG,AA 0.512382930534 6.16178606889e-05
AA,AA,GG 0.446287102628 0.364284082594
AA,AA,GA 0.523248143764 0.0933724543834
AA,AA,AA 0.6002091849 0.00598325294334
AA,AG,GG 0.513945751102 0.312243499367
AA,AG,GA 0.590906792238 0.0800335323286
AA,AG,AA 0.667867833374 0.00512850252286
AA,GG,GG 0.581604399576 0.0669093212928
AA,GG,GA 0.658565440712 0.0171500426418
AA,GG,AA 0.735526481848 0.00109896482633
编辑:添加了以前的帖子链接以及预期的输出。
答案 0 :(得分:1)
免责声明:我没有对此进行测试,而是一个伪代码。
我提供了一些关于编程的慢/快的普遍思路,尤其是在python中:
您应该尝试将循环中所有未更改的内容移出循环。 另外,在python中,您应该尝试用理解替换循环 https://www.pythonforbeginners.com/basics/list-comprehensions-in-python
[ expression for item in list if conditional ]
如果可能,您应该尝试使用地图/过滤器功能,并且还可以准备数据以使程序更高效
rsid=snp[j]
riskallele=riskall[rsid]
基本上是双重映射,如果可以这样创建snp结构(可以在最后一列中使用-1索引并删除pop
),则可能会做得更好:
snp = [{"riskall": line[1],"freq": float(line[4]),"weight": float(line[-1])}
for line in map(split,f)]
,您的计算循环可能会变成这样:
### compute scores for each combination
stop = sys.argv[3]
with open(sys.argv[2], 'r') as f:
for fline in f:
score=0.0 # work with floats from the start
freq=1.0
line = fline.split() # do it only once
for j,field in line:
s=snp[j]
riskallele=s["riskall"]
frequency=s["freq"]
wei=s["weight"]
(allele1,allele2) = line[j]
if allele2 != riskallele: # homozygous for ref
score+=0
freq*=(1-frequency)*(1-frequency)
elif allele1 != riskallele and allele2 == riskallele: # heterozygous, be sure that A2 is risk allele!
score+=wei
freq*=2*(1-frequency)*frequency
elif allele1 == riskallele: # and allele2 == riskallele: # homozygous for risk, be sure to limit risk to second allele!
score+=2*wei
freq*=frequency*frequency
if freq < stop): # threshold to stop loop in interest of efficiency
break
print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))
我想要实现的最终目标是将其转换为某些map / reduce形式:
等位基因可以有[A,C,G,T] [A,C,G,T] 16个组合,我们用[A,C,G,T]这64个组合对它进行测试,因此我可以创建一个表格形式
[AC,C]-> score,freq_function,我可以摆脱整个if
块。
有时最好的方法是将代码拆分为小功能,重新组织然后合并回去。