从一个巨大的文本文件中获取数据,以有效地替换另一个巨大的文本文件中的数据(Python)

时间:2014-11-18 12:25:57

标签: python performance text replace large-files

我已经编程了几个月,所以我不是专家。我有两个巨大的文本文件(omni,~20 GB,~2.5M行; dbSNP,~10 GB,~60M行)。他们有前几行,不一定是制表符分隔的,从&#34开始;#" (标题),其余行以制表符分隔的列(实际数据)进行组织。

每行的前两列包含染色体编号和染色体上的位置,而第三列包含识别码。在" omni"文件我没有ID,因此我需要在dbSNP文件(数据库)中找到位置,并创建使用ID完成的第一个文件的副本。

由于内存限制,我决定逐行读取这两个文件,并从最后一行读取重新开始。我不满意我的代码的效率,因为我觉得它比它可能更慢。由于缺乏经验,我很确定这是我的错。有没有办法使用Python加快速度?问题可能是文件的打开和关闭?

我通常在GNOME Terminal(Python 2.7.6,Ubuntu 14.04)中启动脚本,如下所示:

  

python -u Replace_ID.py> Replace.log 2> Replace.err

非常感谢你。

omni(Omni example):

  

...
  #CHROM POS ID REF ALT ...
  1 534247。 C T ...
  ...

dbSNP(dbSNP example):

  

...
  #CHROM POS ID REF ALT ...
  1 10019 rs376643643 TA T ...
  ......

输出应该与Omni文件完全相同,但在位置后面有rs ID。

代码:

SNPline = 0    #line in dbSNP file
SNPline2 = 0    #temporary copy
omniline = 0    #line in omni file
line_offset = []    #beginnings of every line in dbSNP file (stackoverflow.com/a/620492)
offset = 0
with open("dbSNP_build_141.vcf") as dbSNP: #database
    for line in dbSNP:
        line_offset.append(offset)
        offset += len(line)
    dbSNP.seek(0)
with open("Omni_replaced.vcf", "w") as outfile:     
    outfile.write("")       
with open("Omni25_genotypes_2141_samples.b37.v2.vcf") as omni:  
    for line in omni:           
        omniline += 1
        print str(omniline) #log
        if line[0] == "#":      #if line is header
            with open("Omni_replaced.vcf", "a") as outfile:
                outfile.write(line) #write as it is
        else:
            split_omni = line.split('\t') #tab-delimited columns
            with open("dbSNP_build_141.vcf") as dbSNP:
                SNPline2 = SNPline          #restart from last line found
                dbSNP.seek(line_offset[SNPline])    
                for line in dbSNP:
                    SNPline2 = SNPline2 + 1 
                    split_dbSNP = line.split('\t')  
                    if line[0] == "#":
                        print str(omniline) + "#" + str(SNPline2) #keep track of what's happening.
                        rs_found = 0    #it does not contain the rs ID
                    else:
                        if split_omni[0] + split_omni[1] == split_dbSNP[0] + split_dbSNP[1]:    #if chromosome and position match
                            print str(omniline) + "." + str(SNPline2) #log
                            SNPline = SNPline2 - 1
                            with open("Omni_replaced.vcf", "a") as outfile:
                                split_omni[2] = split_dbSNP[2]  #replace the ID
                                outfile.write("\t".join(split_omni)) 
                            rs_found = 1    #ID found
                            break        
                        else:
                            rs_found = 0    #ID not found
                if rs_found == 0:   #if ID was not found in dbSNP, then:
                    with open("Omni_replaced.vcf", "a") as outfile:
                        outfile.write("\t".join(split_omni)) #keep the line unedited
                else:   #if ID was found:
                    pass    #no need to do anything, line already written
    print "End."

1 个答案:

答案 0 :(得分:0)

这是我对你的问题的贡献。 首先,这是我对你的问题的理解,只是为了检查我是否正确: 您有两个文件,每个文件都是制表分隔值文件。第一个是dbSNP,包含数据,第三列是与基因染色体编号(第1列)对应的标识符,以及基因在染色体上的位置(第2列)。

任务包括获取omni文件并使用来自dbNSP文件的所有值(基于染色体编号和基因位置)填充ID列。

问题来自文件的大小。 您试图保留每行的文件位置以进行搜索并直接转到好行以避免将所有dbnsp文件放入内存中。由于多个文件打开,这种方法对你来说不够快,这就是我的建议。

解析dbNSP文件一次以仅保留基本信息,即成对(number,position):ID。 从您对应的示例:

1   534247  rs201475892
1   569624  rs6594035
1   689186  rs374789455

这相当于文件大小不到10%的内存,所以从一个20GB的文件开始,你将加载到内存小于2GB,它可能是负担得起的(不知道你之前尝试过什么样的加载)。

所以这是我的代码来做到这一点。不要犹豫要求解释,不像你,我使用对象编程。

import argparse

#description of this script
__description__ = "This script parse a Database file in order to find the genes identifiers and provide them to a other file.[description to correct]\nTake the IDs from databaseFile and output the targetFile content enriched with IDs"

# -*- -*-  -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- 
#classes used to handle and operate on data
class ChromosomeIdentifierIndex():
    def __init__(self):
        self.chromosomes = {}

    def register(self, chromosomeNumber, positionOnChromosome, identifier):
        if not chromosomeNumber in self.chromosomes:
            self.chromosomes[chromosomeNumber] = {}

        self.chromosomes[chromosomeNumber][positionOnChromosome] = identifier

    def __setitem__(self, ref, ID):
        """ Allows to use alternative syntax to chrsIndex.register(number, position, id) : chrsIndex[number, position] = id """
        chromosomeNumber, positionOnChromosome = ref[0],ref[1]
        self.register(chromosomeNumber, positionOnChromosome, ID)

    def __getitem__(self, ref):
        """ Allows to get IDs using the syntax: chromosomeIDindex[chromosomenumber,positionOnChromosome] """
        chromosomeNumber, positionOnChromosome = ref[0],ref[1]
        try:
            return self.chromosomes[chromosomeNumber][positionOnChromosome]
        except:
            return "."

    def __repr__(self):
        for chrs in self.chromosomes.keys():
            print "Chromosome : ", chrs
            for position in self.chromosomes[chrs].keys():
                print "\t", position, "\t", self.chromosomes[chrs][position]

class Chromosome():
    def __init__(self, string):
        self.values   = string.split("\t")
        self.chrs     = self.values[0]
        self.position = self.values[1]
        self.ID       = self.values[2]

    def __str__(self):
        return "\t".join(self.values)

    def setID(self, ID):
        self.ID = ID
        self.values[2] = ID

class DefaultWritter():
    """ Use to print if no output is specified """
    def __init__(self):
        pass
    def write(self, string):
        print string
    def close(self):
        pass

# -*- -*-  -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- 
#The code executed when the scrip is called
if __name__ == "__main__":

    #initialisation
    parser = argparse.ArgumentParser(description = __description__)
    parser.add_argument("databaseFile"  , help="A batch file that contains many informations, including the IDs.")
    parser.add_argument("targetFile"    , help="A file that contains informations, but miss the IDs.")
    parser.add_argument("-o", "--output", help="The output file of the script. If no output is specified, the output will be printed on the screen.")
    parser.add_argument("-l", "--logs"  , help="The log file of the script. If no log file is specified, the logs will be printed on the screen.")
    args = parser.parse_args()

    output = None
    if args.output == None:
        output = DefaultWritter()
    else:
        output = open(args.output, 'w')

    logger = None
    if args.logs == None:
        logger = DefaultWritter()
    else:
        logger = open(args.logs, 'w')

    #start of the process

    idIndex = ChromosomeIdentifierIndex()

    #build index by reading the database file.
    with open(args.databaseFile, 'r') as database:
        for line in database:
            if not line.startswith("#"):
                chromosome = Chromosome(line)
                idIndex[chromosome.chrs, chromosome.position] = chromosome.ID

    #read the target, replace the ID and output the result
    with open(args.targetFile, 'r') as target:
        for line in target:
            if not line.startswith("#"):
                chromosome = Chromosome(line)
                chromosome.setID(idIndex[chromosome.chrs, chromosome.position])
                output.write(str(chromosome))
            else:
                output.write(line)

    output.close()
    logger.close()          

主要思想是解析dbNSP文件一次并收集字典中的所有ID。然后逐行阅读omnifile并输出结果。

您可以像这样调用脚本:

python replace.py ./dbSNP_example.vcf ./Omni_example.vcf -o output.vcf

我使用的argparse模块和导入到handle参数也提供了自动帮助,所以参数的描述可用

python replace.py -h

python replace.py --help

我相信他的方法会比你的方法更快,因为我只读了一次文件,之后就开始使用RAM了,我邀请你去测试它。

注意:我不知道你是否熟悉Object编程,所以我必须提到这里所有的类都在同一个文件中,以便在堆栈溢出时发布。在现实生活中的用例中,好的做法是将所有类放在单独的文件中,例如“Chromosome.py”,“ChromosomeIdentifierIndex.py”和“DefaultWritter.py”,然后将它们导入到“replace.py”中。 “档案。