用数字和单词解析线条

时间:2014-04-08 15:12:37

标签: python regex parsing

我有一个要解析的文件,我不知道哪个是制作正则表达式的最佳策略。我想获得de数据所在的行。 (我已经从线上获取了我想要的数据,但我意识到我留下了一些匹配,因为我的第一个正则表达式并不好。)

以下是我尝试过的一些正则表达式/策略:

  1. 找到标题并匹配以下所有内容,直到两个空行:

    data_regex = re.compile("(?<=    ------- ------ -----    ------- ------ -----   ---- --  --------     -----------\n)[^(\n)^(\n)^]+")
    

    匹配的内容:

    1.3e-26   92.9  13.7    4.3e-26   91.2   8.9    2.0  2  BPD_transp_1 Binding-protein-dependent transport system inne
    4.7e-34  117.1  19.5      9e-34  116.2  13.5    1.4  1  BPD_transp_1 Binding-protein-dependent transport system inne
    3.2e-153  509.4   5.2   3.6e-153  509.2   3.6    1.0  1  IMPDH        IMP dehydrogenase / GMP reductase domain
    1.3e-20   73.2   0.2    3.4e-19   68.6   0.1    2.5  3  DEAD         DEAD/DEAH box helicase
    6.9e-11   42.1   0.0    1.8e-09   37.5   0.0    2.4  2  CTP_transf_2 Cytidylyltransferase
    

    正如你所看到的那样,它与某些数据相匹配,但并不是我想象的所有数据。但我尝试了另一个:

    data_regex = re.compile("(?<=    E-value  score  bias    E-value  score  bias    exp  N  Model        Description\s)(.+\s)+")
    

    在这个表达式中,我预计会有更多需要,包括---行,但我最终得到了这个:

    3.6    7.2  11.6       0.13   11.9   3.6    2.0  2  Spore_YabQ   Spore cortex protein YabQ (Spore_YabQ)
    
    0.63    9.6   3.1       0.42   10.2   0.3    2.1  2  IBV_3C       IBV 3C protein
    
    0.38    9.6   4.8       0.65    8.9   0.8    2.6  3  PcrB         PcrB family
    
    0.059   12.6   0.3          1    8.6   0.0    2.8  3  DUF699       Putative ATPase (DUF699)
    
    0.028   14.1   0.9         14    5.7   0.0    3.8  4  HEAT         HEAT repeat
    

    再次,一些结果,但不是我的预期

  2. 多次找到数字分隔的结构,然后找到单词:

    data_regex = re.compile("(\s+([+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s+)(\w+\s)+")
    

    但它找到了许多数字,而不是数字空格,我想要的单词:

    (' 2010 ', '2010', 'Medical ')
    (' 1 ', '1', 'domain ')
    ('    1.5  ', '1.5', '1 ')
    ('   6.2e-27      ', '6.2e-27', '12 ')
    ('      17     ', '17', '129 ')
    ('       7     ', '7', '130 ')
    (' 0.92\n\n  ', '0.92', 'each ')
    (' 5.2e-31\n                        ', '5.2e-31', 'PucR ')
    
  3. 我用它来获得比赛

    data_result = re.findall(data_regex, document)
    print data_result
    

    我正在解析的数据类型,文件的摘录:

    # CPU time: 0.66u 0.50s 00:00:01.16 Elapsed: 00:00:00.55
    # Mc/sec: 902.81
    //
    Query:       LD_216  [L=247]
    Description: # 237337 # 238077 # 1 # ID=1_216;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.390
    Scores for complete sequence (score includes all domains):
       --- full sequence ---   --- best 1 domain ---    -#dom-
        E-value  score  bias    E-value  score  bias    exp  N  Model        Description
        ------- ------ -----    ------- ------ -----   ---- --  --------     -----------
          3e-24   85.3   0.0    5.2e-24   84.5   0.0    1.4  1  ABC_tran     ABC transporter
        3.2e-11   42.5   0.3    9.7e-11   40.9   0.2    1.7  1  SMC_N        RecF/RecN/SMC N terminal domain
        3.1e-05   22.4   0.1       0.17   10.1   0.0    2.6  2  ABC_ATPase   Predicted ATPase of the ABC class
        6.5e-05   21.8   0.1     0.0001   21.2   0.0    1.3  1  DUF258       Protein of unknown function, DUF258
          0.001   19.0   0.5       0.21   11.5   0.0    2.2  2  AAA          ATPase family associated with various cellular 
         0.0019   16.4   0.1     0.0046   15.1   0.0    1.6  2  DLIC         Dynein light intermediate chain (DLIC)
         0.0032   15.8   0.1      0.028   12.7   0.0    2.0  2  Adeno_IVa2   Adenovirus IVa2 protein
      ------ inclusion threshold ------
          0.016   14.5   0.3      0.037   13.4   0.2    1.8  1  Arch_ATPase  Archaeal ATPase
          0.018   14.3   0.0      0.046   13.0   0.0    1.6  1  UPF0079      Uncharacterised P-loop hydrolase UPF0079
           0.02   13.3   0.2      0.041   12.3   0.1    1.4  1  Rad17        Rad17 cell cycle checkpoint protein
          0.026   13.7   0.1      0.049   12.8   0.0    1.4  1  PduV-EutP    Ethanolamine utilisation - propanediol utilisat
          0.046   12.2   0.0      0.085   11.4   0.0    1.5  1  GSPII_E      Type II/IV secretion system protein
           0.05   12.4   0.0      0.087   11.6   0.0    1.4  1  Mg_chelatase Magnesium chelatase, subunit ChlI
          0.054   12.0   0.2      0.094   11.2   0.2    1.7  1  NB-ARC       NB-ARC domain
          0.056   12.9   0.1       0.15   11.5   0.1    1.8  1  MobB         Molybdopterin guanine dinucleotide synthesis pr
          0.059   12.0   0.4        8.9    4.8   0.0    2.4  2  KAP_NTPase   KAP family P-loop domain
          0.079   12.3   0.3       0.57    9.5   0.1    2.1  2  AAA_5        AAA domain (dynein-related subfamily)
          0.086   11.9   0.2       0.32   10.0   0.0    2.0  2  IstB         IstB-like ATP binding protein
           0.13   11.0   1.6        3.5    6.3   0.1    2.7  3  KaiC         KaiC
           0.23   11.3   1.3       0.92    9.4   0.1    2.7  4  RNA_helicase RNA helicase
    
    
    Domain annotation for each model (and alignments):
    >> ABC_tran  ABC transporter
    
    
    # Here begins other type of data but above there are two empty lines
    

    ------ inclusion threshold ------行可以在------- ------ ----- ------- ------ ----- ---- -- -------- -----------行之后或随机位置。如果可能的话,我想知道它与每一行匹配的位置,因为如果它们包含在阈值中,我将需要对它们进行不同的处理。

    如何获取文件的这些行?

    预期产出:

          3e-24   85.3   0.0    5.2e-24   84.5   0.0    1.4  1  ABC_tran     ABC transporter
        3.2e-11   42.5   0.3    9.7e-11   40.9   0.2    1.7  1  SMC_N        RecF/RecN/SMC N terminal domain
        3.1e-05   22.4   0.1       0.17   10.1   0.0    2.6  2  ABC_ATPase   Predicted ATPase of the ABC class
        6.5e-05   21.8   0.1     0.0001   21.2   0.0    1.3  1  DUF258       Protein of unknown function, DUF258
          0.001   19.0   0.5       0.21   11.5   0.0    2.2  2  AAA          ATPase family associated with various cellular 
         0.0019   16.4   0.1     0.0046   15.1   0.0    1.6  2  DLIC         Dynein light intermediate chain (DLIC)
         0.0032   15.8   0.1      0.028   12.7   0.0    2.0  2  Adeno_IVa2   Adenovirus IVa2 protein
    
          0.016   14.5   0.3      0.037   13.4   0.2    1.8  1  Arch_ATPase  Archaeal ATPase
          0.018   14.3   0.0      0.046   13.0   0.0    1.6  1  UPF0079      Uncharacterised P-loop hydrolase UPF0079
           0.02   13.3   0.2      0.041   12.3   0.1    1.4  1  Rad17        Rad17 cell cycle checkpoint protein
          0.026   13.7   0.1      0.049   12.8   0.0    1.4  1  PduV-EutP    Ethanolamine utilisation - propanediol utilisat
          0.046   12.2   0.0      0.085   11.4   0.0    1.5  1  GSPII_E      Type II/IV secretion system protein
           0.05   12.4   0.0      0.087   11.6   0.0    1.4  1  Mg_chelatase Magnesium chelatase, subunit ChlI
          0.054   12.0   0.2      0.094   11.2   0.2    1.7  1  NB-ARC       NB-ARC domain
          0.056   12.9   0.1       0.15   11.5   0.1    1.8  1  MobB         Molybdopterin guanine dinucleotide synthesis pr
          0.059   12.0   0.4        8.9    4.8   0.0    2.4  2  KAP_NTPase   KAP family P-loop domain
          0.079   12.3   0.3       0.57    9.5   0.1    2.1  2  AAA_5        AAA domain (dynein-related subfamily)
          0.086   11.9   0.2       0.32   10.0   0.0    2.0  2  IstB         IstB-like ATP binding protein
           0.13   11.0   1.6        3.5    6.3   0.1    2.7  3  KaiC         KaiC
           0.23   11.3   1.3       0.92    9.4   0.1    2.7  4  RNA_helicase RNA helicase
    

    修改 我最后更改了使用readlines()读取文件,然后为每行执行以下操作:

    elif lines.startswith("   "):
        data_regex = re.compile("-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?")#Matches numbers
        data_result = re.findall(data_regex, lines) 
        data_regex2 = re.compile("[?!]") # Some other characters found
        data_result2 = re.findall(data_regex2, lines)
        data_regex3 = re.compile("-{2,}") # Finds where are the ----- lines
        data_result3 = re.findall(data_regex3, lines)
    
    # There are numbers in the line and there are 10 or more words and numbers (8 numbers
    # and plus id and description), and it doesn't have any "strange" character or it is
    # a --- line
        if data_result != [] and len(lines.split()) >= 10 and data_result2 == [] and data_result3 == []:
            print lines[:-1]
    

2 个答案:

答案 0 :(得分:0)

我的建议:

  1. 删除所有这些评论,例如-----blablabla-----等,以便您拥有仅包含数据列的文件
  2. 如果您使用numpy,请假设列以tab分隔。

    #!/usr/bin/env python
    
    import numpy as np
    
    dat = np.genfromtxt('data.txt', delimiter='\t', dtype=str)
    

    dat将包含类型为str的二维数组中的所有数字和单词,然后dat[:,0:7]将包含所有数字。

答案 1 :(得分:0)

我在读完文件行之后最终做了这个正则表达式。

data_regex = re.compile("^ {3,10}((-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s*){8}.+")

它检查行开头({3,10})的足够空格(^)以避免其他数据,后跟8({8})个数字(-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?) (\s*)与行的其余部分(.+

之间可能存在空格