我有一个大约2 GB的巨大文件,其数据如下:
>TRINITY_DN19211_c0_g1_i1 len=332 path=[619:0-331] [-1, 619, -2]
GTCCAAGTATTACACACCGTATGATGAAGCTAACGGTGAATTTTCAAAATGTGTGAAGTT
TGAGAATGGGTTGCGCCCTGAGATCAAACAGGCGATTGGATACCAGAGGATTCGAAGGTT
TTCGGAGTTGGTAGACTGCTGCAGGATCTTTGAAGAGGATTCCAGAGCAAGGTCAACTCA
>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA
>TRINITY_DN35855_c0_g1_i1 len=782 path=[760:0-781] [-1, 760, -2]
CAGGTTTAACTTTAACACCTCCGACCCTGCCTCTAAATTCCTGCACAGAAATTTGGCTTC
ACAATTAGGACATGTTTGGATAAACAGTTTAATGAAGCACTTTTTTTCATAAATTCTGGT
ATCTGGCTATAAGACCTAATAATCTGGGGATCTGTTTCATCATCCACGAAGGGAGCCCAA
>TRINITY_DN67801_c0_g1_i1 len=420 path=[398:0-419] [-1, 398, -2]
GTACAGAAGGAGATGAACCAGAACTTTGCCTATCTCTACAATCATCTCCTTATCCCTCCT
TATGACCCAGAGAATCCGGCTGCTCCTATTCCTCCCGTTGTGTCACTACAAATTATGCCT
>TRINITY_DN52435_c0_g1_i1 len=209 path=[187:0-208] [-1, 187, -2]
TGGTCAAACTTGTATGAGTTCTAAACTCCTTGGGTTTTCTGCTAAGCGAAAGCCGCTTGT
ACTTTAGCTTCTGTTTAGTTAGATAGCACCACCTCATAAGCGCAGTTCTGTTTTGAGGTT
我想写一个返回一个块的代码,从5行开始,如果遇到字符“>”则结束在一条线上。出来就是这样。我想提取许多像这样的夹头:
>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA
最好的方法是什么。提前致谢。
答案 0 :(得分:1)
如果您知道数据从哪一行开始,则可以使用此功能:
def extract_chunk(start_line):
"""
start_line is the line number where your data starts, counting from 0
"""
lines = []
with open("data.txt") as f:
for i, line in enumerate(f):
if i == start_line:
lines.append(line)
elif not line.startswith(">") and i > start_line:
lines.append(line)
elif line.startswith(">"):
break
return "".join(lines)
答案 1 :(得分:1)
当你遇到一个'''当你想要你的大块结束时,你不清楚它是什么时候结束的。在一行的开头或行中的任何地方,所以我将假设第一个场景:
chunk = []
with open("your_large_file.ext", "r") as f:
for _ in xrange(4): # skip 4 lines, use range() on Python 3.x instead
next(f)
for line in f:
if chunk and line.startswith(">"): # break on > if we're already collecting a chunk
break
chunk.append(line)
print("".join(chunk)) # or whatever you want to do with it
>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA
答案 2 :(得分:0)
这可能是另一种解决方案,
def get_chuck():
full_str = ""
# file1.txt in my case where I have mocked your data
with open("file1.txt") as f:
for line in f:
full_str += line
full_str = [">"+x for x in full_str.split(">")[1:]]
print full_str[0]
# use full_str for your need
get_chuck()
输出
>TRINITY_DN19211_c0_g1_i1 len=332 path=[619:0-331] [-1, 619, -2]
GTCCAAGTATTACACACCGTATGATGAAGCTAACGGTGAATTTTCAAAATGTGTGAAGTT
TGAGAATGGGTTGCGCCCTGAGATCAAACAGGCGATTGGATACCAGAGGATTCGAAGGTT
TTCGGAGTTGGTAGACTGCTGCAGGATCTTTGAAGAGGATTCCAGAGCAAGGTCAACTCA
答案 3 :(得分:0)
start_ln = 4
chunk = []
with open("data.txt", buffer=2**12) as f: # buffering helps for speed of processing
for i, ln in enumerate(f):
if start_ln == i:
chunk.append(ln)
elif start_ln < i:
chunk.append(ln)
elif line.startswith(">"):
break