使用Python迭代两个大列表

时间:2015-03-24 05:33:15

标签: python

我有两个文件都是制表符分隔的。其中一个文件是近800k行,它是一个An Exonic Coordinates文件,另一个文件几乎是200k行(它是一个VCF文件)。

我正在python中编写一个代码来查找和过滤VCF中位于外部坐标内的位置(外显子坐标文件中的外显子开始和结束)并将其写入文件。

但是,由于文件很大,需要花费几天时间才能获得过滤后的输出文件?

所以下面的代码部分解决了速度的问题,但问题是弄清楚是为了加快过滤过程,这就是为什么我用了一个休息来退出第二个循环而我想从外面开始循环而不是从第一个循环中获取下一个元素(外部循环)?

这是我的代码:

import
import sys
list_coord = []
with open('ref_ordered.txt', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter='\t')
    for row in reader:
                 list_coord.append((row[0],row[1],row[2]))

    def parseVcf(vcf,src):
        done = False
        with open(vcf,'r') as f:
                    reader=csv.reader((f),delimiter='\t')
                    vcf_out_split = vcf.split('.')
                    vcf_out_split.insert(2,"output_CORRECT2")
                    outpt = open('.'.join(vcf_out_split),'a')
                    for coord in list_coord:
                            for row in reader:
                                   if '#' not in row[0]:
                                            coor_genom = int(row[1])
                                            coor_exon1 = int(coord[1])+1
                                            coor_exon2 = int(coord[2])
                                            coor_genom_chr = row[0]
                                            coor_exon_chr = coord[0]
                                            ComH = row[7].split(';')
                                            for x in ComH:
                                               if 'DP4=' in x:
                                                 DP4_split=x[4:].split(',')
                                                 if (coor_exon1 <= coor_genom <= coor_exon2):
                                                    if (coor_genom_chr == coor_exon_chr):
                                                       if ((int(DP4_split[2]) >= 1 and int(DP4_split[3]) >= 1)):
                                                         done = True

                                                         outpt.write('\t'.join(row) + '\n')

                                            if done:
                                                    break
                    outpt.close()
for root,dirs,files in os.walk("."):
    for file in files:
      pathname=os.path.join(root,file)
      if file.find("1_1")==0:
        print "Parsing " + file
        parseVcf(pathname, "1_1")

ref_ordered.txt:

1   69090   70008
1   367658  368597
1   621095  622034
1   861321  861393
1   865534  865716
1   866418  866469
1   871151  871276
1   874419  874509

1_1输入文件:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT     directory
1   14907   rs79585140  A   G   20  .   DP=10;VDB=5.226464e-02;RPB=-6.206015e-01;AF1=0.5;AC1=1;DP4=1,2,5,2;MQ=32;FQ=20.5;PV4=0.5,0.07,0.16,0.33;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.312;CADD=1.415;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=61 GT:PL:GQ    0/1:50,0,51:50
1   14930   rs75454623  A   G   44  .   DP=9;VDB=7.907652e-02;RPB=3.960091e-01;AF1=0.5;AC1=1;DP4=1,2,6,0;MQ=41;FQ=30.9;PV4=0.083,1,0.085,1;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.000;CG=-1.440;CADD=1.241;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=38  GT:PL:GQ    0/1:74,0,58:61
1   15211   rs78601809  T   G   9.33    .   DP=6;VDB=9.014600e-02;RPB=-8.217058e-01;AF1=1;AC1=2;DP4=1,0,3,2;MQ=21;FQ=-37;PV4=1,0.35,1,1;DN=131;DA=T/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.145;CADD=1.611;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=171    GT:PL:GQ    1/1:41,10,0:13
1   16146   .   A   C   25  .   DP=10;VDB=2.063840e-02;RPB=-2.186229e+00;AF1=0.5;AC1=1;DP4=7,0,3,0;MQ=39;FQ=27.8;PV4=1,0.0029,1,0.0086;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=unknown;CP=0.001;CG=-0.555;CADD=2.158;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=197 GT:PL:GQ    0/1:55,0,68:58
1   16257   rs78588380  G   C   40  .   DP=18;VDB=9.421102e-03;RPB=-1.327486e+00;AF1=0.5;AC1=1;DP4=3,11,4,0;MQ=50;FQ=43;PV4=0.011,1,1,1;DN=131;DA=G/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-2.500;CADD=0.359;AA=G;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=308   GT:PL:GQ    0/1:70,0,249:73
1   16378   rs148220436 T   C   39  .   DP=7;VDB=2.063840e-02;RPB=-9.980746e-01;AF1=0.5;AC1=1;DP4=0,4,0,3;MQ=50;FQ=42;PV4=1,0.45,1,1;DN=134;DA=T/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.016;CG=-2.880;CADD=0.699;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-cluster;DSP=227    GT:PL:GQ    0/1:69,0,90:72

输出文件:

1   877831  rs6672356   T   C   44.8    .   DP=2;VDB=6.720000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;DN=116;DA=T/C;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=missense,missense,missense,missense,missense,intron;FD=unknown;AAC=TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,none;PP=343/682,343/715,328/667,327/666,234/573,NA;CDP=1027,1027,982,979,700,NA;GS=101,101,101,101,101,NA;PH=0;CP=0.994;CG=2.510;CADD=0.132;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-cluster,by-1000G;DSP=38;CPG=875731-878363;GESP=C:8470/T:0;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ    1/1:76,6,0:10
1   878000  .   C   T   44.8    .   DP=2;VDB=7.520000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=synonymous,synonymous,synonymous,synonymous,synonymous,intron;FD=unknown;AAC=LEU,LEU,LEU,LEU,LEU,none;PP=376/682,376/715,361/667,360/666,267/573,NA;CDP=1126,1126,1081,1078,799,NA;CP=0.986;CG=3.890;CADD=2.735;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DSP=62;CPG=875731-878363;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA    GT:PL:GQ    1/1:76,6,0:10
1   881627  rs2272757   G   A   205 .   DP=9;VDB=1.301207e-01;AF1=1;AC1=2;DP4=0,0,5,4;MQ=50;FQ=-54;DN=100;DA=G/A;GM=NM_015658.3,XM_005244739.1;GL=NOC2L;FG=synonymous;FD=synonymous-codon,unknown;AAC=LEU;PP=615/750,615/755;CDP=1843;CP=0.082;CG=5.170;CADD=0.335;AA=G;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-frequency,by-cluster,by-1000G;DSP=40;GESP=A:6174/G:6830;PAC=NP_056473.2,XP_005244796.1   GT:PL:GQ    1/1:238,27,0:51

2 个答案:

答案 0 :(得分:1)

首先,我没有包含任何代码,因为它看起来像我的家庭作业(我有这样的功课)。然而,我将尝试解释我为改进脚本所采取的步骤,即使我知道我的解决方案远非完美。

您的脚本可能会很慢,因为您的csv文件中的每一行都会打开,写入并关闭输出文件。尝试列出要添加到输出文件的行,在完成阅读和过滤后,再开始编写。

您还可能需要考虑为每个过滤器编写函数,并使用行作为变量调用这些函数。这样,您可以在以后轻松添加过滤器。我使用计数器来跟踪成功过滤器的数量,如果最后counter == len(amountOfUsedFilers)我将我的行添加到列表中。

此外,为什么使用outpt = open('.'.join(vcf_out_split),'a')with open(vcf,'r') as f:尝试在您的选择中保持一致和聪明。

获胜的生物信息学!

答案 1 :(得分:0)

如果您的两个文件都是有序的,您可以通过并行迭代来节省大量时间,始终推进具有最低坐标的文件。这样,您只需处理一次,而不是多次。

以下是您的代码的基本版本,它只进行坐标检查(我不完全理解您的DP4条件,所以我会留给您将该部分添加回来):

with open(coords_fn) as coords_f, open(vcf_fn) as vcf_f, open(out_fn) as out_f:
    coords = csv.reader(coords_f, delimiter="\t")
    vcf = csv.reader(vcf_f, delimiter="\t")
    out = csv.writer(out_f, delimiter="\t")

    next(vcf) # discard header row, or use out.writeline(next(vcf)) to preserve it!

    try:
        c = next(coords)
        r = next(vcf)

        while True:
            if int(c[1]) >= int(r[1]):   # vcf file is behind
                r = next(vcf)
            elif int(c[2]) < int(r[1]):  # coords file is behind
                c = next(coords)
            else:                        # int(c[1]) < int(r[1]) <= int(c[2])
                out.writeline(r)  # add DP4 check here, and indent this line under it

                r = next(vcf)     # don't indent this line
    except StopIteration: # one of the files has ended
        pass