Question

对于我的研究，我需要处理生物序列（fasta格式）的大文本文件（10gi），更确切地说，我必须将具有特定id的特定序列放入特殊序列。一个fasta序列是这样的：

＆gt; id | id_number（例如102574）| stuff

ATGCGAT .... ATGTC ..（多行）

所以我编写脚本来搜索这些大文件的块，以便将我的搜索（以及使用我的8 cpu）与python的多处理库进行并行化

我注入到我的多进程类中的函数如下：

idlist=inP[0] # list of good id 
    filpath=inP[1] # chunck of the big file
    idproc=inP[2]   # id of the process

    #######################
    fil=filpath.split('\n')
    del filpath
    f=open('seqwithid{0}'.format(idproc),'w')
    def lineiter():
        for line in fil:
            yield line
    it=lineiter()
    line=it.next()

    while 1:
        try:
            ids=line.split('|')[1].split('locus')[0].partition('ref')[0]
            #print ids
            while ids[0].isalpha():
                ids=ids[1:]
        except Exception:
            pass
        else:
            if ids in idlist: 
                f.write(line+'\n')
                while 1:
                    try:
                        line=it.next()
                    except Exception:
                        break
                    if line and line[0]!='>':
                        f.write(line+'\n')
                    else:
                        break
        try:                
            line=it.next()
        except Exception:
            break
        while  not line or line[0]!='>':
            try:
                line=it.next()
            except Exception:
                break
    f.close()

为了提高速度，我用C函数重写这段代码：

我将文件切成块：

f1=fopen(adr, "r");
if (f1==0){printf("wrong sequences file: %s\n",adr);exit(1);}

fstream = (char *) malloc((end-begin)*sizeof(char) );
fseek(f1,begin,SEEK_CUR);
fread(fstream,sizeof(char)*(end-begin-1),1,f1);
adrtampon=fgetc(f1);

while (!(feof(f1)) && adrtampon!=ter)
{
    sprintf(fstream,"%s%c",fstream,adrtampon);
    adrtampon=fgetc(f1);
}
fclose(f1);

我使用主要函数运行块，直到找到'＆gt;'字符：

adrtampon=fstream[0];   
i=0;

while(adrtampon!='\0' )
{
    adrtampon=fstream[i];
    if (adrtampon==ter)
    {
        sprintf(id,"%s",seekid((fstream+i)));

        if (checkidlist(id,tab,size)==0) 
        {
            i++;
            fputc('>',f2);
            adrtampon=fstream[i];
            while (adrtampon!='\0' &&  adrtampon!=ter)  
            {
                fputc(adrtampon,f2);
                i++;
                adrtampon=fstream[i];
            }
            i--;
        }
    }
    i++;
}

当我找到'＆gt;'时我首先提取两个'|'之间序列的id然后我将我的librairy of intersting id与另一个简单的函数循环（类似于idlist中的if id）然后使用仍然使用多处理类的python函数调用此函数最后......即使只有一个进程，我使用C代码获得的性能最差，而不是python代码。（当我直接处理文件而不是处理块时，我获得了更好的perf with C但只有一个进程，因为可以访问多进程文件（我认为））任何改进我的C代码的建议，并解释为什么它比python中的等效代码慢？?????非常感谢！！（特别是如果你已经到了这里！）

Answer 1

可能一个10GB的文件不适合内存（如果它适合内存你就可以像我here一样。）所以阅读和处理它的唯一方法是：读取一个部分，进程那部分，阅读下一部分。如果linelength有限，fgets（）是最优雅的。否则，您可以一次读取一个字符并使用小型状态机进行处理。读取缓冲区大小的块是可能的，但更难，因为逻辑行将跨越缓冲区边界。

搜索python脚本比C等效更快

1 个答案: