我想获得>>Sequence Length Distribution warn
和>>END_MODULE
之间的值。
...
>>Sequence Length Distribution warn
#Length Count Percent==ReadLength Percent >=ReadLength
32 6273 0.00103077 0.103077012 100
...
40 4043555 0.664431004 66.44310036 66.44310036 398560
>>END_MODULE
...
以下代码仅打印>>Sequence Length Distribution warn
而不打印任何值。
with open("fastqc_data_example.txt", 'rU') as f:
for line in f:
start = False
stop = False
if line.startswith(">>Sequence Length Distribution"):
start = True
if start and stop == False:
if line.startswith(">>END_MODULE"):
stop = True
print line
这是完整档案:
##FastQC 0.11.2
>>Basic Statistics pass
#Measure Value
Filename CLI_S8_L001_R2_001.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 6085741
Sequences flagged as poor quality 0
Sequence length 32-40
%GC 38
>>END_MODULE
>>Per base sequence quality pass
#Base Mean Median Lower Quartile Upper Quartile 10th Percentile 90th Percentile
1 31.33065653 32 32 32 32 32
2 31.29100285 32 32 32 32 32
21112 37 0.463116522
21112 38 0.316991647
21112 39 0.331988727
21112 40 0.189154113
>>END_MODULE
>>Per sequence quality scores pass
#Quality Count
2 4976
3 1471
33 314850
34 930733
35 3951958
>>END_MODULE
>>Per base sequence content fail
#Base G A T C
1 17.36647868 26.98028366 25.65663156 29.9966061
2 22.77713232 33.89914627 22.4154475 20.90827391
40 20.56104803 0 50.69674445 28.74220753
>>END_MODULE
>>Per sequence GC content pass
#GC Content Count
0 6418
1 7437.5
99 1436.5
100 2454
>>END_MODULE
>>Per base N content pass
#Base N-Count
1 0.117980703
2 0.078149892
40 0.196089827
>>END_MODULE
>>Sequence Length Distribution warn
#Length Count Percent==ReadLength Percent >=ReadLength
32 6273 0.00103077 0.103077012 100
33 337 5.53753E-05 0.005537534 99.89692299
40 4043555 0.664431004 66.44310036 66.44310036 398560
>>END_MODULE
>>Sequence Duplication Levels warn
#Total Deduplicated Percentage 67.07258691
#Duplication Level Percentage of deduplicated Percentage of total
1 70.12224901 47.03280641
>5k 0 0
>10k+ 0 0
>>END_MODULE
>>Overrepresented sequences pass
>>END_MODULE
>>Adapter Content pass
#Position Illumina Universal Adapter Illumina Small RNA Adapter Nextera Transposase Sequence
1 0 0 0
2 0 0 1.64E-05
27 8.22E-05 1.31E-04 1.81E-04
28 8.22E-05 1.31E-04 1.81E-04
>>END_MODULE
>>Kmer Content fail
#Sequence Count PValue Obs/Exp Max Max Obs/Exp Position
TTTAAGT 10275 0 12.408118 34
CATAAAG 6720 0 9.565453 1
>>END_MODULE
答案 0 :(得分:2)
您重置每个循环迭代的start
和stop
的值。您需要在for
循环之外声明它们:
start = False
stop = False
for line in f:
....
答案 1 :(得分:1)
不是使用两个布尔变量start和stop,而是使用一个,它告诉你何时在两行之间。
content
答案 2 :(得分:1)
尝试这样的事情:
f = open('test', 'r')
start = False
for line in f:
if line.startswith(">>Sequence Length Distribution"):
start = True
if line.startswith(">>END_MODULE"):
start = False
if start:
print line
input()
答案 3 :(得分:1)
如果你不关心>>或者“结束模块”,那么以下将是一个相当干净的解决方案
with open("test", 'rU') as f:
# Split file contents into blocks
blocks = f.read().split(r'>>')
# Print the first block with the desired name
print [block for block in blocks
if block.startswith('Sequence Length Distribution')][0]