Question

我需要经过超过一千兆字节的文本并用空格包围标点符号（标记化）。我有一个很长的正则表达式（1818个字符，虽然这主要是列表），它定义了标点符号不应该分开的时间。虽然漫长而复杂，但很难使用群组，但我不会将其作为一种选择，因为我可以使大多数群体不被捕获（？：）。

问题：如何有效地替换与特定正则表达式不匹配的某些字符？

我已经研究过使用前瞻或类似的东西了，我还没有弄明白，但无论如何它看起来效率都非常低。但它可能比使用占位符更好。我似乎无法找到一个好的“用一堆不同的正则表达式代替一次性查找和替换”功能。

我应该逐行而不是对整个文本进行操作吗？

String completeRegex = "[^\\w](("+protectedPrefixes+")|(("+protectedNumericOnly+")\\s*\\p{N}))|"+protectedRegex;
Matcher protectedM = Pattern.compile(completeRegex).matcher(s);
ArrayList<String> protectedStrs = new ArrayList<String>();
//Take note of the protected matches.
while (protectedM.find()) {
    protectedStrs.add(protectedM.group());
}
//Replace protected matches.
String replaceStr = "<PROTECTED>";
s = protectedM.replaceAll(replaceStr);

//Now that it's safe, separate punctuation.
s = s.replaceAll("([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])"," $1 ");

// These are for apostrophes. Can these be combined with either the protecting regular expression or the one above?
s = s.replaceAll("([\\p{N}\\p{L}])'(\\p{L})", "$1 '$2");
s = s.replaceAll("([^\\p{L}])'([^\\p{L}])", "$1 ' $2");

请注意撇号的两个额外替换。使用占位符也可以防止这些替换，但我并不关心我的保护正则表达式中的撇号或单引号，所以这不是一个真正的问题。

我用Java自己重写了我认为非常低效的Perl代码，跟踪速度，直到我开始用原始字符串替换占位符之前一切正常。有了这个添加它太慢而不合理（我从未见过它甚至接近完成）。

//Replace placeholders with original text.
String resultStr = "";
String currentStr = "";
int currentPos = 0;
int[] protectedArray = replaceStr.codePoints().toArray();

int protectedLen = protectedArray.length;
int[] strArray = s.codePoints().toArray();
int protectedCount = 0;
for (int i=0; i<strArray.length; i++) {
    int pt = strArray[i];
//          System.out.println("pt: "+pt+" symbol: "+String.valueOf(Character.toChars(pt)));
    if (protectedArray[currentPos]==pt) {
        if (currentPos == protectedLen - 1) {
            resultStr += protectedStrs.get(protectedCount);
            protectedCount++;
            currentPos = 0;
        } else {
            currentPos++;
        }
    } else {
        if (currentPos > 0) {
            resultStr += replaceStr.substring(0, currentPos);
            currentPos = 0;
            currentStr = "";
        }
        resultStr += ParseUtils.getSymbol(pt);
    }

}
s = resultStr;

此代码可能不是返回受保护匹配的最有效方法。什么是更好的方法？或者更好的是，如何在不使用占位符的情况下替换标点符号？

Answer 1

起初我认为appendReplacement不是我想要的，但事实确实如此。由于它最终取代了占位符，减慢了速度，所以我真正需要的是一种动态替换匹配的方法：

StringBuffer replacedBuff = new StringBuffer();
Matcher replaceM = Pattern.compile(replaceStr).matcher(s);
int index = 0;
while (replaceM.find()) {
    replaceM.appendReplacement(replacedBuff, "");
    replacedBuff.append(protectedStrs.get(index));
    index++;
}
replaceM.appendTail(replacedBuff);
s = replacedBuff.toString();

参考：Second answer at this question。

另一个需要考虑的选择：在第一次通过String时，要查找受保护的字符串，请获取每个匹配的开始和结束索引，替换匹配项外的所有内容的标点符号，添加匹配的字符串，然后继续。这消除了使用占位符编写String的需要，并且只需要遍历整个String。但是，它确实需要许多单独的小型更换操作。（顺便说一句，确保在循环之前编译模式，而不是使用String.replaceAll（））。类似的替代方法是将未受保护的子字符串添加到一起，然后同时替换它们。但是，受保护的字符串必须在最后添加到替换的字符串中，所以我怀疑这会节省时间。

int currIndex = 0;
while (protectedM.find()) {
    protectedStrs.add(protectedM.group());
    String substr = s.substring(currIndex,protectedM.start());
    substr = p1.matcher(substr).replaceAll(" $1 ");
    substr = p2.matcher(substr).replaceAll("$1 '$2");
    substr = p3.matcher(substr).replaceAll("$1 ' $2");
    resultStr += substr+protectedM.group();
    currIndex = protectedM.end();
}

100,000行文字的速度比较：

原始Perl脚本：272.960579875秒
我的第一次尝试：太久了。
使用appendReplacement（）：14.245160866秒
找到受保护时更换：68.691842962秒

感谢Java，不要让我失望。

Answer 2

我不确切地知道你的中间字符串有多大，但我怀疑你能比速度方面的Matcher.replaceAll做得更好。

您在字符串中进行3次传递，每次创建新的Matcher实例，然后创建新的String;并且因为您正在使用+来连接字符串，所以您要创建一个新字符串，它是中间字符串和受保护组的串联，然后是连接它时的另一个字符串到目前的结果。你真的不需要所有这些额外的实例。

首先，您应该在resultStr中累积StringBuilder，而不是通过直接字符串连接。然后你可以继续这样的事情：

StringBuilder resultStr = new StringBuilder();
int currIndex = 0;
while (protectedM.find()) {
  protectedStrs.add(protectedM.group());
  appendInBetween(resultStr, str, current, protectedM.str());
  resultStr.append(protectedM.group());
  currIndex = protectedM.end();
}
resultStr.append(str, currIndex, str.length());

其中appendInBetween是实现等效替换的方法，只需一次传递：

void appendInBetween(StringBuilder resultStr, String s, int start, int end) {
  // Pass the whole input string and the bounds, rather than taking a substring.

  // Allocate roughly enough space up-front.
  resultStr.ensureCapacity(resultStr.length() + end - start);

  for (int i = start; i < end; ++i) {
    char c = s.charAt(i);

    // Check if c matches "([^\\p{L}\\p{N}\\p{Mn}_\\-<>'])".
    if (!(Character.isLetter(c)
          || Character.isDigit(c)
          || Character.getType(c) == Character.NON_SPACING_MARK
          || "_\\-<>'".indexOf(c) != -1)) {
      resultStr.append(' ');
      resultStr.append(c);
      resultStr.append(' ');
    } else if (c == '\'' && i > 0 && i + 1 < s.length()) {
      // We have a quote that's not at the beginning or end.
      // Call these 3 characters bcd, where c is the quote.

      char b = s.charAt(i - 1);
      char d = s.charAt(i + 1);

      if ((Character.isDigit(b) || Character.isLetter(b)) && Character.isLetter(d)) {
        // If the 3 chars match "([\\p{N}\\p{L}])'(\\p{L})"
        resultStr.append(' ');
        resultStr.append(c);
      } else if (!Character.isLetter(b) && !Character.isLetter(d)) {
        // If the 3 chars match "([^\\p{L}])'([^\\p{L}])"
        resultStr.append(' ');
        resultStr.append(c);
        resultStr.append(' ');
      } else {
        resultStr.append(c);
      }
    } else {
      // Everything else, just append.
      resultStr.append(c);
    }
  }
}

Ideone demo

显然，与此代码相关的维护成本 - 无可否认更加冗长。但是明确地这样做的优点（除了它只是一次通过的事实）是你可以像其他任何一样调试代码 - 而不是它只是正则表达式的黑盒子。

我有兴趣知道这对你来说是否更快！

除非匹配复杂的正则表达式

2 个答案: