Question

我正在尝试在空格和标点符号上拆分文本行，我设法做到了，但是现在在生成的拆分行数组中也包含空格：

public static void main(String[] args) {

        String test = "tim's work 'cool' asdas 'right' three-year-old 123123.";
        String rePattern = "[?,.!\\s]|(?<=\\s)\\'|\\'(?=[^a-zA-Z])";

        String[] arr = test.split(rePattern);

        for (int i = 0; i < arr.length; i++) {
            System.out.println(arr[i]);
        }
    }

例如，上面的吐痰将打印：

tim's
work

cool

asdas

right

因此，看来我正在设法正确分割标点符号，但它仍在数组中包含空字符串。我该如何优化我的正则表达式，使其在拆分时不包含空字符串？

Answer 1

一种选择是在空格/句子结尾字符集的两边加上可选的'，确保'被split 占用如果可能，使用空格/句子终止符：

String rePattern = "'?[?,.!\\s]'?";

输出：

tim's
work
was
cool
asdas
right

请注意，没有必要在正则表达式中转义'，至少在字符串分隔符为"的Java中则不需要转义。另外，除非期望的空格不是普通空格（例如，换行符，制表符或类似字符），否则要使用 other 空格，否则只能使用文字空格而不是\\s，如果您愿意的话，它更精确，更简洁（例如String rePattern = "'?[?,.! ]'?";）

Answer 2

由于分隔符重叠，因此您需要合并多个匹配项。

"(?:[?,.!\\s]|(?<=\\s)'|'(?=[^a-zA-Z]))+"

https://regex101.com/r/BRYxiE/1

 (?:
      [?,.!\s] 
   |  
      (?<= \s )
      '
   |  
      '
      (?= [^a-zA-Z] )
 )+

匹配而不是拆分实际上可能更好。
您会得到更好的控制。

编辑：
快速浏览边缘情况后，确定
此构造(?<=\s)是一个肯定的要求，应替换为这个(?<!\S)的否定要求，即空白边界。

原因是在BOS / EOS处，空白或负边界也匹配。

修改后的正则表达式为

"(?:[?,.!\\s]|(?<!\\S)'|'(?=[^a-zA-Z]))+"

https://regex101.com/r/JGQ6Rw/1

 (?:
      [?,.!\s] 
   |  
      (?<! \S )
      '
   |  
      '
      (?= [^a-zA-Z] )
 )+

Answer 3

这是一个提议的新解决方案。
不必担心特定的标点符号，而将所有标点符号拆分为
不被[a-z]字母包围。

"(?i)(?:(?:\\pP+|\\s)(?<![a-z]\\pP(?=[a-z])))+"

https://regex101.com/r/cNmHF8/1

 (?i)
 (?:
      (?: \pP+ | \s )               # Punct's or whitespace
      (?<!                          # But not under both these conditions
           [a-z] \pP                     # A letter directly before Punct
           (?= [a-z] )                   # and a letter directly after
      )
 )+

这仍然不是解析单词的正确方法。

更新
What is the proper way to parse words then..? – doctopus

好吧，如果仅由标点符号控制，那是imo的最佳方式
是识别单词的内部部分。

那是开头字符，然后是正文。
正文尽可能包含标点符号
标点符号的多个序列被一个字母包围。

以这种方式进行操作，就不能使用split函数来完成，
但必须通过查找所有类型的函数来提取
单个捕获组就能获得成功。

这是imo应该做的事情。

有一个特殊功能，可让您输入单词结尾标点符号
将会暂停匹配并将其视为单词的结尾。
对于?.!这样的字符，则需要这样做。
根据需要添加更多内容。

"[^\\pL\\pN]*([\\pL\\pN](?:[\\pL\\pN_-]|(?![?.!])\\pP(?=[\\pL\\pN\\pP]))*)(?<!\\pP)"

https://regex101.com/r/flUmcB/1

一些解释

 # Unicode
 # [^\pL\pN]*([\pL\pN](?:[\pL\pN_-]|(?![?.!])\pP(?=[\pL\pN\pP]))*)(?<!\pP)

 [^\pL\pN]*                    # Strip non-letters/numbers               
 (                             # (1 start)
      [\pL\pN]                      # First letter/number
      (?:                           # Word body
           [\pL\pN_-]                    # Letter/number or '-'
        |                              # or,
           (?! [?.!] )                   # ( Not Special word ending punctuation, Add more here )
           \pP                           # Punctuation
           (?= [\pL\pN\pP] )             #   if followed by punctuation/letter/number
      )*                            # Do many times
 )                             # (1 end)
 (?<! \pP )                    # Don't end on a punctuation

如何优化此正则表达式以在空格和标点符号（减撇号）上分割线

3 个答案: