仅解析文件的一部分

时间:2016-02-24 03:05:00

标签: python parsing

我想获得>>Sequence Length Distribution warn>>END_MODULE之间的值。

    ...
    >>Sequence Length Distribution  warn
    #Length Count       Percent==ReadLength Percent >=ReadLength
    32  6273    0.00103077  0.103077012 100
    ...
    40  4043555 0.664431004 66.44310036 66.44310036     398560
    >>END_MODULE
    ...

以下代码仅打印>>Sequence Length Distribution warn而不打印任何值。

    with open("fastqc_data_example.txt", 'rU') as f:

        for line in f:
            start = False
            stop = False
            if line.startswith(">>Sequence Length Distribution"):
                start = True

            if start and stop == False:
                if line.startswith(">>END_MODULE"):
                    stop = True
                print line

这是完整档案:

    ##FastQC    0.11.2                  
    >>Basic Statistics  pass                    
    #Measure    Value                   
    Filename    CLI_S8_L001_R2_001.fastq.gz                 
    File type   Conventional base calls                 
    Encoding    Sanger / Illumina 1.9                   
    Total Sequences 6085741                 
    Sequences flagged as poor quality   0                   
    Sequence length 32-40                   
    %GC 38                  
    >>END_MODULE                        
    >>Per base sequence quality pass                    
    #Base   Mean    Median  Lower Quartile  Upper Quartile  10th Percentile 90th Percentile
    1   31.33065653 32  32  32  32  32
    2   31.29100285 32  32  32  32  32
    21112   37  0.463116522             
    21112   38  0.316991647             
    21112   39  0.331988727             
    21112   40  0.189154113             
    >>END_MODULE                        
    >>Per sequence quality scores   pass                    
    #Quality    Count                   
    2   4976                    
    3   1471
    33  314850                  
    34  930733                  
    35  3951958                 
    >>END_MODULE                        
    >>Per base sequence content fail                    
    #Base   G   A   T   C       
    1   17.36647868 26.98028366 25.65663156 29.9966061      
    2   22.77713232 33.89914627 22.4154475  20.90827391
    40  20.56104803 0   50.69674445 28.74220753     
    >>END_MODULE                        
    >>Per sequence GC content   pass                    
    #GC Content Count                   
    0   6418                    
    1   7437.5
    99  1436.5                  
    100 2454                    
    >>END_MODULE                        
    >>Per base N content    pass                    
    #Base   N-Count                 
    1   0.117980703                 
    2   0.078149892
    40  0.196089827                 
    >>END_MODULE                        
    >>Sequence Length Distribution  warn                    
    #Length Count       Percent==ReadLength Percent >=ReadLength        
    32  6273    0.00103077  0.103077012 100     
    33  337 5.53753E-05 0.005537534 99.89692299     
    40  4043555 0.664431004 66.44310036 66.44310036     398560
    >>END_MODULE                        
    >>Sequence Duplication Levels   warn                    
    #Total Deduplicated Percentage  67.07258691                 
    #Duplication Level  Percentage of deduplicated  Percentage of total             
    1   70.12224901 47.03280641
    >5k 0   0               
    >10k+   0   0               
    >>END_MODULE                        
    >>Overrepresented sequences pass                    
    >>END_MODULE                        
    >>Adapter Content   pass                    
    #Position   Illumina Universal Adapter  Illumina Small RNA Adapter  Nextera Transposase Sequence            
    1   0   0   0           
    2   0   0   1.64E-05
    27  8.22E-05    1.31E-04    1.81E-04            
    28  8.22E-05    1.31E-04    1.81E-04            
    >>END_MODULE                        
    >>Kmer Content  fail                    
    #Sequence   Count   PValue  Obs/Exp Max Max Obs/Exp Position        
    TTTAAGT 10275   0   12.408118   34
    CATAAAG 6720    0   9.565453    1       
    >>END_MODULE                        

4 个答案:

答案 0 :(得分:2)

您重置每个循环迭代的startstop的值。您需要在for循环之外声明它们:

start = False
stop = False
for line in f:
    ....

答案 1 :(得分:1)

不是使用两个布尔变量start和stop,而是使用一个,它告诉你何时在两行之间。

content

答案 2 :(得分:1)

尝试这样的事情:

f = open('test', 'r')
    start = False
for line in f:
    if line.startswith(">>Sequence Length Distribution"):
        start = True
    if line.startswith(">>END_MODULE"):
        start = False
    if start:
        print line
    input()

答案 3 :(得分:1)

如果你不关心>>或者“结束模块”,那么以下将是一个相当干净的解决方案

with open("test", 'rU') as f:
    # Split file contents into blocks
    blocks = f.read().split(r'>>')
    # Print the first block with the desired name
    print [block for block in blocks 
            if block.startswith('Sequence Length Distribution')][0]