遍历巨大的文本文件:使用python读取两个重复发生的模式之间的块

时间:2018-12-15 16:55:31

标签: python file-io bioinformatics

我需要遍历生物序列数据库GeneGene的巨大文本文件(对于内存来说太大了20GB),并为每个数据库条目提取相同的信息。每个条目都以行function objOfMatches(array1, array2, cb) { var obj = {}; var newArray1 = array1.map(cb); for (let i = 0; i < newArray1.length; i++) { if( !(array2.indexOf(newArray1[i]) == -1) ) { obj[newArray1[i].toLowerCase()] = newArray1[i]; } } return obj; } console.log(objOfMatches(['hi', 'howdy', 'bye', 'later', 'hello'], ['HI', 'Howdy', 'BYE', 'LATER', 'hello'], function(str) { return str.toUpperCase(); })); // should log: { hi: 'HI', bye: 'BYE', later: 'LATER' } 开始,以行LOCUS XYZ some more text结尾。例如:

//

现在,有一种方法可以告诉python iterativeley将该文件的相应3个块读取到某个变量var中。更准确地说:

迭代1:var =

LOCUS 123 some more text many lines of some more text many lines of some more text many lines of some more text // LOCUS 231 some more text many lines of some more text many lines of some more text many lines of some more text // LOCUS 312 some more text many lines of some more text many lines of some more text many lines of some more text //

迭代2:var =

LOCUS 123 some more text many lines of some more text many lines of some more text many lines of some more text //

迭代3:var =

LOCUS 231 some more text many lines of some more text many lines of some more text many lines of some more text //

在此先感谢您,并为即将到来的假期致以最大的感谢

1 个答案:

答案 0 :(得分:0)

假设我们有以下文本文件:

LOCUS 421 bla bla ba
Lorem ipsum dolor sit amet, 
consectetur adipiscing elit. 
Duis eu erat orci. Quisque 
nec augue ultricies, dignissim 
neque id, feugiat risus.
//
LOCUS 421 blabla
Nullam pulvinar quis ante
at condimentum.
//

我们可以做到:

is_processing = True

pf = open("somefile.txt", "r")

# Handles chunks
while True:
    first_chunk_line = True
    chunk_lines = []

    # Handles one chunk
    while True:
        data_line = pf.readline()

        # detect the end of the file
        if data_line == '':
            is_processing = False
            break

        # Detect first line
        if first_chunk_line:
            if "LOCUS" not in data_line:
                raise Exception("Data file is malformed!")

            first_chunk_line = False
            continue  # don't process the line
        # Detect end of locus / chunk
        if data_line.strip() == "//":
            break

        # if it is neither a first line, and end line nor the end of the file
        # then it must be a chunk line holding precious DNA information
        chunk_lines.append(data_line)

    # end the while loop
    if not is_processing:
        break

    # do something with one chunk lines
    print(chunk_lines)