正则表达式匹配与重叠符号的冲突

时间:2016-09-21 01:10:08

标签: java regex token

我尝试匹配所有包含符号<>的令牌,但存在一些冲突。特别是,我的令牌为<><//>以及以<!--开头并以-->结尾的注释。

我对这些的正则表如下:

String LTHAN = "<"; 
String GTHAN = ">";
String LTHAN_SLASH = "</";
String GTHAN_SLASH = "/>";
String COMMENT = "<!--.*-->";

我通过使用常规方法将它们添加到列表来编译它们:

public void add(String regex, int token) {
    tokenInfos.add(new TokenInfo(Pattern.compile("^(" + regex + ")"), token));
}

以下是我的TokenInfo类:

private class TokenInfo {
    public final Pattern regex;
    public final int token;

    public TokenInfo(Pattern regex, int token) {
        super();
        this.regex = regex;
        this.token = token;
    }
}

我匹配并显示如下列表:

public void tokenize(String str) {
    String s = new String(str);
    tokens.clear();
    while (!s.equals("")) {
        boolean match = false;

        for (TokenInfo info : tokenInfos) {
            Matcher m = info.regex.matcher(s);
            if (m.find()) {
                match = true;

                String tok = m.group().trim();
                    tokens.add(new Token(info.token, tok));

                s = m.replaceFirst("");
                break;
            }
        }
    }
}

阅读并显示:

    try {
        BufferedReader br;
        String curLine;
        String EOF = null;
        Scanner scan = new Scanner(System.in);
        StringBuilder sb = new StringBuilder();

        try {    
            File dir = new File("C:\\Users\\Me\\Documents\\input files\\example.xml");
            br = new BufferedReader(new FileReader(dir));

            while ((curLine = br.readLine()) != EOF) {
                sb.append(curLine);
                // System.out.println(curLine);
            }
            br.close();
        } catch (IOException e) {
            System.out.println(e.getMessage());
        }

        tokenizer.tokenize(sb.toString());

        for (Tokenizer.Token tok : tokenizer.getTokens()) {
            System.out.println("" + tok.token + " " + tok.sequence);
        }
    } catch (Exception e) {
        System.out.println(e.getMessage());
    }
}

示例输入:

<!-- Sample input file with incomplete recipe -->
<recipe name="bread" prep_time="5 mins" cook_time="3 hours">
   <title>Basic bread</title>
   <ingredient amount="3" unit="cups">Flour</ingredient>
   <instructions>
     <step>Mix all ingredients together.</step>
   </instructions>
</recipe>

但是,输出的令牌列表会将</(包括其后的所有字符)识别为单独的令牌,这意味着它似乎永远无法识别令牌</和{{ 1}}。与评论相同的问题。这是我的正则表达式的问题吗?为什么它不识别模式/></

希望我的问题很明确。很高兴在必要时提供更多细节/示例。

1 个答案:

答案 0 :(得分:1)

的问题:

  1. 您的初始正则表达式^(<)将与整个输入匹配。这个正则表达式意味着文本必须以<开头,整个输入字符串就是这样。所以你必须修理它。
  2. 如果整个标记(没有文本内容 - 如基本面包将所有成分混合在一起)被视为令牌。所以相应的正则表达式应该是一个正则表达式。
  3. 解决方案

    尝试将正则表达式更改为以下内容:

    1. 对于单个代码 - <[^>]*>
    2. 对于单个结束标记 - </[^]*>;
    3. 征求意见 - &lt;! - 。* - &gt; (这已经是正确的了)
    4. 示例程序

      import java.io.BufferedReader;
      import java.io.File;
      import java.io.FileReader;
      import java.io.IOException;
      import java.util.ArrayList;
      import java.util.HashMap;
      import java.util.Map.Entry;
      import java.util.regex.Matcher;
      import java.util.regex.Pattern;
      
      public class RegexTest {
          private static ArrayList<TokenInfo> tokenInfoList = new ArrayList<>();
          private static ArrayList<String> tokensList = new ArrayList<>();
      
          public static void add(String regex, int token) {
              tokenInfoList.add(new TokenInfo(Pattern.compile(regex), token));
          }
      
          static {
              String LTHAN = "<[^>]*>";
              String LTHAN_SLASH = "</[^>]*>";
              String COMMENT = "<!--.*-->";
              add(LTHAN, 1);
              add(LTHAN_SLASH, 3);
              add(COMMENT, 5);
          }
      
          private static class TokenInfo {
              public final Pattern regex;
              public final int token;
      
              public TokenInfo(Pattern regex, int token) {
                  super();
                  this.regex = regex;
                  this.token = token;
              }
          }
      
          public static void tokenize(String str) {
              String s = new String(str);
              while (!s.equals("")) {
                  boolean match = false;
                  for (TokenInfo info : tokenInfoList) {
                      Matcher m = info.regex.matcher(s);
                      if (m.find()) {
                          match = true;
                          String tok = m.group().trim();
                          tokensList.add(tok);
                          s = m.replaceFirst("");
                          break;
                      }
                  }
                  // The following is under the assumption that the Text nodes within the document are not considered tokens and replaced
                  if (!match) {
                      break;
                  }
              }
          }
      
          public static void main(String[] args) {
              try {
                  BufferedReader br;
                  String curLine;
                  String EOF = null;
                  StringBuilder sb = new StringBuilder();
      
                  try {
                      File dir = new File("/home/itachi/Desktop/recipe.xml");
                      br = new BufferedReader(new FileReader(dir));
      
                      while ((curLine = br.readLine()) != EOF) {
                          sb.append(curLine);
                          // System.out.println(curLine);
                      }
                      br.close();
                  } catch (IOException e) {
                      System.out.println(e.getMessage());
                  }
      
                  tokenize(sb.toString());
      
                  for (String eachToken : tokensList) {
                     System.out.println(eachToken);
                  }
              } catch (Exception e) {
                  System.out.println(e.getMessage());
              }
          }
      }
      

      参考

      http://www.regular-expressions.info/是学习正则表达式的绝佳资源。