Question

我在Java中使用字符串模式匹配。我有一个问题，当试图匹配模式时，CPU会变高并且什么都不做。我有100个字符串需要检查它是否与2个模式匹配。

以下是我使用的示例代码。当它匹配模式2，即patternMatch [1]时，它停止并且第一个字符串（patternList）的CPU为100％。我怎样才能做得更好？

String[] patternMatch = {"([\\w\\s]+)+([+\\-/*])+([\\w\\s]+)",
     "([\\w\\s]+)+([+\\-/*])+([\\w\\s]+)+([+\\-/*])+([\\w\\s]+)"};
    List<String> patternList = new ArrayList<String>();

    patternList.add("Avg Volume Units product A + Volume Units product A");
    patternList.add("Avg Volume Units /  Volume Units product A");
    patternList.add("Avg retailer On Hand / Volume Units Plan / Store Count");
    patternList.add("Avg Hand Volume Units Plan Store Count");
    patternList.add("1 - Avg merchant Volume Units");
    patternList.add("Total retailer shipment Count");

    for (String s :patternList ){

        for(int i=0;i<patternMatch.length;i++){
            Pattern pattern = Pattern.compile(patternMatch[i]);

            Matcher matcher = pattern.matcher(s);
            System.out.println(s);
            if (matcher.matches()) {

                System.out.println("Passed");
            }else
                System.out.println("Failed;");
        }

    }

Answer 1

看起来你正面临着由([\\w\\s]+)+引起的catastrophic backtracking变体。请尝试使用([\\w\\s]+)代替

String[] patternMatch = {
        "([\\w\\s]+)([+\\-/*])+([\\w\\s]+)",
        "([\\w\\s]+)([+\\-/*])+([\\w\\s]+)([+\\-/*])+([\\w\\s]+)"
};

Answer 2

对于灾难性的回溯，@ pshemo可能是正确的。但是，我建议采用一种完全不同的方法，使用String.split()和零 - 使用前瞻来匹配运算符（+-*/）之前和之后。

String[] x = s.split("((?<=[\\-\\+\\*/])|(?=[\\-\\+\\*/]))");
if (x.length == 3 || x.length== 5)
    System.out.println("Passed");
else
    System.out.println("Failed");

split返回一个数组，其中包含奇数偏移（1,3）处的运算符和偶数偏移（0,2和4）处的运算符之间的字符串。这应该比带回溯的正则表达式快。

Answer 3

我认为没有必要量化量化的单一群体例如，(?:(?:X)+)*就像这样X*

量化的单一群体以这种方式引起指数回溯要使用模型，这将更好(?:(?:X))*本身不会造成灾难性的回溯。

另一个问题是你应该尽量避免分组单一的完全建构。

在您的示例中，类都是单一（基础）构造的示例。

另外，如果可以，请使用群集(?:,,)而不是捕获(,,) 像([+\-/*])+这样的结构将匹配任何这些字符中的1到多个在该课程中，但只会捕获最后字符因此，捕获组无论是分组还是捕获都没有实际用途。

因此，如果您遵循这些规则，并保留捕获组，则新的正则表达式为
看起来像这样：

 # "([\\w\\s]+)([+\\-/*]+)([\\w\\s]+)"

 ( [\w\s]+ )                   # (1)
 ( [+\-/*]+ )                  # (2)
 ( [\w\s]+ )                   # (3)

和

 # "([\\w\\s]+)([+\\-/*]+)([\\w\\s]+)([+\\-/*]+)([\\w\\s]+)"

 ( [\w\s]+ )                   # (1)
 ( [+\-/*]+ )                  # (2)
 ( [\w\s]+ )                   # (3)
 ( [+\-/*]+ )                  # (4)
 ( [\w\s]+ )                   # (5)

正则表达式模式匹配的高CPU利用率

3 个答案: