Java字符串标记化:需要修改正则表达式模式以删除短标记

时间:2016-01-12 20:51:31

标签: java regex

我需要将String标记为如下:

  1. 按空格分割
  2. 删除所有非字母
  3. 删除所有长度小于N的字母标记
  4. 看起来我可以通过

    获得#1和#2
    String str = "blah blah";
    String p1 = "[^a-zA-Z ]";
    String p2 = "\\s+";
    String[] tokens = str.replaceAll(p1, "").split(p2);
    

    我可以将p1修改为#3吗?作为替代方案,我可以做到

    String p1 = "[^a-zA-Z ]";
    String p2 = "\\s+";
    String p3 = ""\\b\\w{1,2}\\b";
    String[] tokens = str.replaceAll(p1, "").replaceAll(p3, "").split(p2);
    

    p3是否正确?

    我也更愿意避免添加另一种模式(效率也会降低,对吧?)

2 个答案:

答案 0 :(得分:0)

您可以将#2#3合并为:

str = str.replaceAll("\\b[^a-zA-Z]*(?:[a-zA-Z][^a-zA-Z ]*){1,2}\\b|[^a-zA-Z ]+", "");

这将删除所有非字母/非空格和所有少于3个字符的单词。

RegEx Demo

答案 1 :(得分:0)

不,p3不正确,因为你在拆分之前消除了一些空格,你没有考虑导致空格导致split()返回空的前导值,你硬编码{{ 1}}。

使用此输入字符串进行测试:

N

以下是4个实现,最后列出了我的解决方案。使用String input = " A Aa AaA AaAa \r\n" + " 1 11 111 1111 \r\n" + " A1 A1a A1c1 A1a1A A1c1A1 A1a1A1a \r\n" + " AeA\tAeAeA "; 调用时,它们将生成以下输出:

n = 3
[AaA, AaAa, AaA, AcA, AaAa, AeA, AeAeA]   ← literalInterpretation
{, AaA, AaAa, AaA, AcA, AaAa, AeAAeAeA}   ← fromQuestion
{, AaA, AaAa, AaA, AcA, AaAa, AeAAeAeA}   ← answerByAnubhava
{AaA, AaAa, AaA, AcA, AaAa, AeA, AeAeA}   ← answerByAnubhavaFixedByMe
{AaA, AaAa, AaA, AcA, AaAa, AeA, AeAeA}   ← myAnswer
private static void literalInterpretation(int n, String input) {
    // 1. Split by whitespace
    String[] values = input.split("(?U)\\s+"); // Whitespaces (unicode character class)

    // 2. Remove all non-letters
    for (int i = 0; i < values.length; i++)
        values[i] = values[i].replaceAll("\\P{L}+", ""); // Non-letters (unicode category)

    // 3. Remove all letter tokens of length less than N
    List<String> tokens = new ArrayList<>();
    for (String value : values)
        if (value.length() >= n)
            tokens.add(value);

    System.out.println(tokens);
}
private static void fromQuestion(int n, String input) {
    String p1 = "[^a-zA-Z ]";
    String p2 = "\\s+";
    String p3 = "\\b\\w{1," + (n-1) + "}\\b";
    String[] tokens = input.replaceAll(p1, "").replaceAll(p3, "").split(p2);

    System.out.println(Arrays.toString(tokens));
}
private static void answerByAnubhava(int n, String input) {
    String str = input.replaceAll("\\b(?:[a-zA-Z][^a-zA-Z ]*){1," + (n-1) + "}\\b|[^a-zA-Z ]+", "");
    String[] tokens = str.split("\\s+");

    System.out.println(Arrays.toString(tokens));
}
private static void answerByAnubhavaFixedByMe(int n, String input) {
    String[] tokens = input.replaceAll("(?U)\\b[^\\p{L}\\s]*(?:\\p{L}[^\\p{L}\\s]*){1," + (n-1) + "}\\b|[^\\p{L}\\s]+", "")
                           .replaceFirst("(?U)^\\s+", "")
                           .split("(?U)\\s+");

    System.out.println(Arrays.toString(tokens));
}