优化读取大文件

时间:2020-06-02 12:59:56

标签: python generator

我需要扫描一个大文件(2.2GB),基本上从每行中检索前两个单词。
这是我写的代码:

"""
A generator, given a file object and an integer N, it will return N lines at a time.
"""
def generate_N_Lines(fileObj, N):
    lines = []
    for i in range(N):
        lines.append(fileObj.readline())
    yield lines

    while lines[0]:
        lines = []
        for j in range(N):
            lines.append(fileObj.readline())
        yield lines
    yield None


def process_pileup(lines, res):
    if not lines:
        return

    for line in lines:
        data = str(line).split('\t')
        res.append([data[0], data[1]])


def scanPileup(pileup_path):
    res = []
    with open(pileup_path) as f:
        for lines in generate_N_Lines(f, 4):
            process_pileup(lines, res)
    print(res)


if __name__ == '__main__':
    scanPileup(path/to/file)

问题是,这需要花很多时间才能运行-上一次我尝试花费了5个多小时,而我只是放弃并停止了它。 我想不出办法减少运行时间。 如何优化代码?

文件类型:txt
文件格式Pileup
例如一行:seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

0 个答案:

没有答案
相关问题