生成正则表达式以查找具有特定字符出现次数的子字符串

时间:2014-04-02 20:58:21

标签: python regex

假设我有一个字符串"IICCIICCIICBIICCIICDII"。该字符串的格式为II[CBD][CBD]II[CBD][CBD]II..。这是一个重复的模式。现在我试图找到满足以下条件的所有重叠子串:

  1. 子字符串不以字母I开头或结尾。建议的解决方案:(?<=[CBD]), (?=[CBD])
  2. 子字符串至少包含(但尽可能少)字母C,B和D的特定出现次数。这些字母可以存在于任何排列中,并且可以具有不同的出现次数。建议的解决方案:[C]{m, n}的内容是什么?
  3. 这些预先定义的数字是可变的,因此我可以动态生成正则表达式,并且需要更改这些变量。
  4. 字母的发生顺序和次数我没有关系
  5. 添加到条件1:子串的开始/结束不能出现两次[CBD](例如,CCIIB或BIICC是无效匹配)(抱歉)
  6. 例如,对于至少有2个Cs的模式:CIIC(其中三个),2个C和1个B:CIICBIIC,BIICCIIC

    我认为我的问题与其中一个答案中引用的问题类似。我看过那个问题(标题为&#34;最短的重复子串&#34;)。我的问题在于重复模式需要具有特定数量的某些字符的意义。引用的问题只是寻找最短的重复模式。这个问题虽然有用。

    如果问题清楚且不重复,请告诉我。 感谢。

1 个答案:

答案 0 :(得分:1)

最终分析

解决您的最新评论 正如所怀疑的那样,除非是,否则不能用正则表达式来完成 它可以做计数。具体来说,countin能够重置计数器
回溯时。

只有一个引擎可以做到这一点,它是Perl,不幸的是,
使用Python完成这项任务是不可能的。

我在下面添加Perl正则表达式来执行此操作。只添加它以显示
方法是否要在不使用正则表达式的情况下完成相同的任务 当然可以做到。

对不起,对你来说无非是一种帮助。 - sln

 # (?{ $vb=0; $vc=0; $vd=0; })(?=(?![BCD]{2})(?![I])((?:(?:[B][I]*?)(?{ local $vb = $vb+1 })|(?:[C][I]*?)(?{ local $vc = $vc+1 })|(?:[D][I]*?)(?{ local $vd = $vd+1 }))+?)(?(?{$vb >= 2 && $vc >= 5 && $vd >= 2})(?{ $VB=$vb; $VC=$vc; $VD=$vd; })|(?!))(?<![I])(?<![BCD]{2}))
 #

 (?{ $vb=0; $vc=0; $vd=0; })         # Initialize local counters to zero
 (?=
      (?! [BCD]{2} )                      # App Condition 5a, not start with 2 occurances of BCD
      (?! [I] )                           # App Condition 1a, not start with I
      (                                   # (1 start)
           (?:                                 # Cluster group start (App Conditions 2-4)
                (?: [B] [I]*? )                     # 'B'
                (?{ local $vb = $vb+1 })            # Increment local 'B' counter
             |  
                (?: [C] [I]*? )                     # 'C'
                (?{ local $vc = $vc+1 })            # Increment local 'C' counter
             |  
                (?: [D] [I]*? )                     # 'D'
                (?{ local $vd = $vd+1 })            # Increment local 'D' counter
           )+?                                 # Cluster group end, do the minimum
                                               # to satisfy conditions
      )                                   # (1 end)

      (?(?{
           # Code conditional - the local counters
           # must be greater than or equal to these values
           $vb >= 2 && $vc >= 5 && $vd >= 2
        })
           # Yes condition, copy local counters to global vars.
           (?{ $VB=$vb; $VC=$vc; $VD=$vd; })
        |  
           # No condition, fail the expression here
           # force engine to backtrack (and reset local counters) 
           (?!)
      )
      (?<! [I] )                          # App Condition 1b, not end with I
      (?<! [BCD]{2} )                     # App Condition 5b, not end with 2 occurances of BCD
 )

Perl测试用例

 $str = "IICCIICBIICCIIDCIICCIICDIICCIIBCIICCIICBIICCIIDCIICCIICCIICCII";
 print  "\n";
 print  "01234567890123456789012345678901234567890123456789012345678901\n";
 print  "          1         2         3         4         5         6\n";
 print  $str,"\n-------------------------------------------------------\n";

 FindOverlaps(2,5,2);
 FindOverlaps(1,2,0);
 FindOverlaps(1,1,0);
 FindOverlaps(1,1,1);
 FindOverlaps(0,1,1);
 FindOverlaps(1,0,1);

 sub FindOverlaps
 {
     ($MinB, $MinC, $MinD) = @_;

     print "\nB=$MinB, C=$MinC, D=$MinD\n";

     while ( $str =~ /

          (?{ $vb=0; $vc=0; $vd=0; })         # Initialize local counters to zero
          (?=
               (?! [BCD]{2} )                      # App Condition 5a, not start with 2 occurances of BCD
               (?! [I] )                           # App Condition 1a, not start with I
               (                                   # (1 start)
                    (?:                                 # Cluster group start (App Conditions 2-4)
                         (?: [B] [I]*? )                     # 'B'
                         (?{ local $vb = $vb+1 })            # Increment local 'B' counter
                      |  
                         (?: [C] [I]*? )                     # 'C'
                         (?{ local $vc = $vc+1 })            # Increment local 'C' counter
                      |  
                         (?: [D] [I]*? )                     # 'D'
                         (?{ local $vd = $vd+1 })            # Increment local 'D' counter
                    )+?                                 # Cluster group end, do the minimum
                                                        # to satisfy conditions
               )                                   # (1 end)

               (?(?{
                    # Code conditional - the local counters
                    # must be greater than or equal to these values
                    $vb >= $MinB && $vc >= $MinC && $vd >= $MinD
                 })
                    # Yes condition, copy local counters to global vars.
                    (?{ $VB=$vb; $VC=$vc; $VD=$vd; })
                 |  
                    # No condition, fail the expression here
                    # force engine to backtrack (and reset local counters) 
                    (?!)
               )
               (?<! [I] )                          # App Condition 1b, not end with I
               (?<! [BCD]{2} )                     # App Condition 5b, not end with 2 occurances of BCD
          )
     /xg )
     {
        print sprintf("found:   %-10s %-30s  offset = %s\n", "\($VB,$VC,$VD\)", $1, @-[0]);
     }
 }

输出&gt;&gt;

 01234567890123456789012345678901234567890123456789012345678901
           1         2         3         4         5         6
 IICCIICBIICCIIDCIICCIICDIICCIIBCIICCIICBIICCIIDCIICCIICCIICCII
 -------------------------------------------------------

 B=2, C=5, D=2
 found:   (2,10,2)   CIICBIICCIIDCIICCIICDIICCIIB    offset = 3
 found:   (2,8,2)    BIICCIIDCIICCIICDIICCIIB        offset = 7
 found:   (2,12,2)   CIIDCIICCIICDIICCIIBCIICCIICBIIC  offset = 11
 found:   (2,12,2)   CIICCIICDIICCIIBCIICCIICBIICCIID  offset = 15
 found:   (2,10,2)   CIICDIICCIIBCIICCIICBIICCIID    offset = 19
 found:   (2,8,2)    DIICCIIBCIICCIICBIICCIID        offset = 23

 B=1, C=2, D=0
 found:   (1,3,0)    CIICBIIC                        offset = 3
 found:   (1,2,1)    BIICCIID                        offset = 7
 found:   (1,7,2)    CIIDCIICCIICDIICCIIB            offset = 11
 found:   (1,6,1)    CIICCIICDIICCIIB                offset = 15
 found:   (1,4,1)    CIICDIICCIIB                    offset = 19
 found:   (1,2,1)    DIICCIIB                        offset = 23
 found:   (1,3,0)    CIIBCIIC                        offset = 27
 found:   (1,5,0)    CIICCIICBIIC                    offset = 31
 found:   (1,3,0)    CIICBIIC                        offset = 35
 found:   (1,2,1)    BIICCIID                        offset = 39

 B=1, C=1, D=0
 found:   (1,3,0)    CIICBIIC                        offset = 3
 found:   (1,1,0)    BIIC                            offset = 7
 found:   (1,7,2)    CIIDCIICCIICDIICCIIB            offset = 11
 found:   (1,6,1)    CIICCIICDIICCIIB                offset = 15
 found:   (1,4,1)    CIICDIICCIIB                    offset = 19
 found:   (1,2,1)    DIICCIIB                        offset = 23
 found:   (1,1,0)    CIIB                            offset = 27
 found:   (1,5,0)    CIICCIICBIIC                    offset = 31
 found:   (1,3,0)    CIICBIIC                        offset = 35
 found:   (1,1,0)    BIIC                            offset = 39

 B=1, C=1, D=1
 found:   (1,4,1)    CIICBIICCIID                    offset = 3
 found:   (1,2,1)    BIICCIID                        offset = 7
 found:   (1,7,2)    CIIDCIICCIICDIICCIIB            offset = 11
 found:   (1,6,1)    CIICCIICDIICCIIB                offset = 15
 found:   (1,4,1)    CIICDIICCIIB                    offset = 19
 found:   (1,2,1)    DIICCIIB                        offset = 23
 found:   (2,7,1)    CIIBCIICCIICBIICCIID            offset = 27
 found:   (1,6,1)    CIICCIICBIICCIID                offset = 31
 found:   (1,4,1)    CIICBIICCIID                    offset = 35
 found:   (1,2,1)    BIICCIID                        offset = 39

 B=0, C=1, D=1
 found:   (1,4,1)    CIICBIICCIID                    offset = 3
 found:   (1,2,1)    BIICCIID                        offset = 7
 found:   (0,1,1)    CIID                            offset = 11
 found:   (0,5,1)    CIICCIICDIIC                    offset = 15
 found:   (0,3,1)    CIICDIIC                        offset = 19
 found:   (0,1,1)    DIIC                            offset = 23
 found:   (2,7,1)    CIIBCIICCIICBIICCIID            offset = 27
 found:   (1,6,1)    CIICCIICBIICCIID                offset = 31
 found:   (1,4,1)    CIICBIICCIID                    offset = 35
 found:   (1,2,1)    BIICCIID                        offset = 39
 found:   (0,1,1)    CIID                            offset = 43

 B=1, C=0, D=1
 found:   (1,4,1)    CIICBIICCIID                    offset = 3
 found:   (1,2,1)    BIICCIID                        offset = 7
 found:   (1,7,2)    CIIDCIICCIICDIICCIIB            offset = 11
 found:   (1,6,1)    CIICCIICDIICCIIB                offset = 15
 found:   (1,4,1)    CIICDIICCIIB                    offset = 19
 found:   (1,2,1)    DIICCIIB                        offset = 23
 found:   (2,7,1)    CIIBCIICCIICBIICCIID            offset = 27
 found:   (1,6,1)    CIICCIICBIICCIID                offset = 31
 found:   (1,4,1)    CIICBIICCIID                    offset = 35
 found:   (1,2,1)    BIICCIID                        offset = 39

<强>(旧)

我认为这是你用正则表达式做的最好的

修改 - 针对新条件5进行了修改。

 #  String:
 #  (?=(?![BCD]{2})(?![I])((?:[B][IDC]*?){1}(?:[C][IDB]*?){2}(?:[D][IBC]*?){0}|(?:[C][IDB]*?){2}(?:[D][IBC]*?){0}(?:[B][IDC]*?){1}|(?:[D][IBC]*?){0}(?:[B][IDC]*?){1}(?:[C][IDB]*?){2}|(?:[C][IDB]*?){2}(?:[B][IDC]*?){1}(?:[D][IBC]*?){0})(?<![I])(?<![BCD]{2}))

 # Example: Finds 1-B, 2-C's     
 (?=
      (?! [BCD]{2} )              # Condition 5a, not start with 2 occurances of BCD
      (?! [I] )                   # Condition 1a, not start with I (not really necessary here)

      (                           # (1 start), Conditions 2-4
           (?: [B] [IDC]*? ){1}
           (?: [C] [IDB]*? ){2}
           (?: [D] [IBC]*? ){0}
        |  
           (?: [C] [IDB]*? ){2}
           (?: [D] [IBC]*? ){0}
           (?: [B] [IDC]*? ){1}
        |  
           (?: [D] [IBC]*? ){0}
           (?: [B] [IDC]*? ){1}
           (?: [C] [IDB]*? ){2}
        |  
           (?: [C] [IDB]*? ){2}
           (?: [B] [IDC]*? ){1}
           (?: [D] [IBC]*? ){0}
      )                           # (1 end)

      (?<! [I] )                  # Condition 1b, not end with I
      (?<! [BCD]{2} )             # Condition 5b, not end with 2 occurances of BCD
 )

Perl测试用例

  $str = "IICCIICCIICBIICCIICDIIDIICCIIB";

  print  "\n";
  print  "012345678911234567892123456789\n";
  print  "          +         +         \n";
  print  $str,"\n------------------------------\n";

  ($B,$C,$D) = (1,2,0);
  FindOverlaps();

  ($B,$C,$D) = (1,1,0);
  FindOverlaps();

  ($B,$C,$D) = (1,1,1);
  FindOverlaps();

  ($B,$C,$D) = (0,1,1);
  FindOverlaps();

  ($B,$C,$D) = (1,0,1);
  FindOverlaps();

  sub FindOverlaps
  {
      print "\nB=$B, C=$C, D=$D\n";

      while ( $str =~ /(?=(?![BCD]{2})(?![I])((?:[B][IDC]*?){$B}(?:[C][IDB]*?){$C}(?:[D][IBC]*?){$D}|(?:[C][IDB]*?){$C}(?:[D][IBC]*?){$D}(?:[B][IDC]*?){$B}|(?:[D][IBC]*?){$D}(?:[B][IDC]*?){$B}(?:[C][IDB]*?){$C}|(?:[C][IDB]*?){$C}(?:[B][IDC]*?){$B}(?:[D][IBC]*?){$D})(?<![I])(?<![BCD]{2}))/g )
      {
          print "found:  '$1' \t offset = @-[0]\n";
      }
  }

输出&gt;&gt;

 012345678911234567892123456789
           +         +
 IICCIICCIICBIICCIICDIIDIICCIIB
 ------------------------------

 B=1, C=2, D=0
 found:  'CIICBIIC'       offset = 7
 found:  'BIICCIIC'       offset = 11

 B=1, C=1, D=0
 found:  'BIIC'   offset = 11
 found:  'CIIB'   offset = 26

 B=1, C=1, D=1
 found:  'BIICCIICDIID'   offset = 11

 B=0, C=1, D=1
 found:  'DIIC'   offset = 22

 B=1, C=0, D=1
 found:  'BIICCIICDIID'   offset = 11
 found:  'DIICCIIB'       offset = 22