如何读取一个巨大文件的特定块

时间:2017-05-24 10:34:47

标签: python

我有一个大约2 GB的巨大文件,其数据如下:

>TRINITY_DN19211_c0_g1_i1 len=332 path=[619:0-331] [-1, 619, -2]
GTCCAAGTATTACACACCGTATGATGAAGCTAACGGTGAATTTTCAAAATGTGTGAAGTT
TGAGAATGGGTTGCGCCCTGAGATCAAACAGGCGATTGGATACCAGAGGATTCGAAGGTT
TTCGGAGTTGGTAGACTGCTGCAGGATCTTTGAAGAGGATTCCAGAGCAAGGTCAACTCA
>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA
>TRINITY_DN35855_c0_g1_i1 len=782 path=[760:0-781] [-1, 760, -2]
CAGGTTTAACTTTAACACCTCCGACCCTGCCTCTAAATTCCTGCACAGAAATTTGGCTTC
ACAATTAGGACATGTTTGGATAAACAGTTTAATGAAGCACTTTTTTTCATAAATTCTGGT
ATCTGGCTATAAGACCTAATAATCTGGGGATCTGTTTCATCATCCACGAAGGGAGCCCAA
>TRINITY_DN67801_c0_g1_i1 len=420 path=[398:0-419] [-1, 398, -2]
GTACAGAAGGAGATGAACCAGAACTTTGCCTATCTCTACAATCATCTCCTTATCCCTCCT
TATGACCCAGAGAATCCGGCTGCTCCTATTCCTCCCGTTGTGTCACTACAAATTATGCCT
>TRINITY_DN52435_c0_g1_i1 len=209 path=[187:0-208] [-1, 187, -2]
TGGTCAAACTTGTATGAGTTCTAAACTCCTTGGGTTTTCTGCTAAGCGAAAGCCGCTTGT
ACTTTAGCTTCTGTTTAGTTAGATAGCACCACCTCATAAGCGCAGTTCTGTTTTGAGGTT

我想写一个返回一个块的代码,从5行开始,如果遇到字符“>”则结束在一条线上。出来就是这样。我想提取许多像这样的夹头:

 >TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
    ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
    TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
    GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
    TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA

最好的方法是什么。提前致谢。

4 个答案:

答案 0 :(得分:1)

如果您知道数据从哪一行开始,则可以使用此功能:

def extract_chunk(start_line):
    """
    start_line is the line number where your data starts, counting from 0
    """
    lines = []
    with open("data.txt") as f:
        for i, line in enumerate(f):
            if i == start_line:
                lines.append(line)
            elif not line.startswith(">") and i > start_line:
                lines.append(line)
            elif line.startswith(">"):
                break
    return "".join(lines)

答案 1 :(得分:1)

当你遇到一个'''当你想要你的大块结束时,你不清楚它是什么时候结束的。在一行的开头或行中的任何地方,所以我将假设第一个场景:

chunk = []
with open("your_large_file.ext", "r") as f:
    for _ in xrange(4):  # skip 4 lines, use range() on Python 3.x instead
        next(f)
    for line in f:
        if chunk and line.startswith(">"):  # break on > if we're already collecting a chunk
            break
        chunk.append(line)
print("".join(chunk))  # or whatever you want to do with it

>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA

答案 2 :(得分:0)

这可能是另一种解决方案,

def get_chuck():
    full_str = ""

    # file1.txt in my case where I have mocked your data
    with open("file1.txt") as f:
        for line in f:
            full_str += line

    full_str = [">"+x for x in full_str.split(">")[1:]]
    print full_str[0]
    # use full_str for your need

get_chuck()

输出

    >TRINITY_DN19211_c0_g1_i1 len=332 path=[619:0-331] [-1, 619, -2]
    GTCCAAGTATTACACACCGTATGATGAAGCTAACGGTGAATTTTCAAAATGTGTGAAGTT
    TGAGAATGGGTTGCGCCCTGAGATCAAACAGGCGATTGGATACCAGAGGATTCGAAGGTT
    TTCGGAGTTGGTAGACTGCTGCAGGATCTTTGAAGAGGATTCCAGAGCAAGGTCAACTCA

答案 3 :(得分:0)

start_ln = 4
chunk = []
with open("data.txt", buffer=2**12) as f:  # buffering helps for speed of processing
   for i, ln in enumerate(f):
        if start_ln == i:
           chunk.append(ln)
        elif start_ln < i:
           chunk.append(ln)
        elif line.startswith(">"):
           break