Java:模式匹配器意外返回新行

时间:2018-11-02 16:51:20

标签: java regex pattern-matching

我有一个用例,我必须处理任何转义/未转义的字符作为分隔符以分隔句子。到目前为止,我们拥有的未转义/转义字符是:

" " (space),"\\t","|", "\\|",";","\\;","," etc

到目前为止,哪个正则表达式正在使用:

String delimiter = " ";
String regex = "(?:\\\\.|[^"+ delimiter +"\\\\]++)*";

输入字符串为:

String input = "234|Tamarind|something interesting ";

现在,下面是拆分和打印的代码:

 List<String> matchList = new ArrayList<>(  );
 Matcher regexMatcher = pattern.matcher( input );
 while ( regexMatcher.find() )
 {
     matchList.add( regexMatcher.group() );
 }

 System.out.println( "Unescaped/escaped test result with size: " + matchList.size() );
 matchList.stream().forEach( System.out::println );

但是,有多余的字符串(新行)被意外存储。因此输出如下:

Unescaped/escaped test result with size: 5
234|Tamarind|something

interesting

.

有没有更好的方法来做到这一点,这样就不会有多余的字符串了?

1 个答案:

答案 0 :(得分:1)

这很容易:确保您至少匹配一个字符。这意味着您可以删除++量词并将*替换为+。参见regex demo

完整Java demo

String delimiter = " ";
String regex = "(?:\\\\.|[^"+ delimiter +"\\\\])+";
// System.out.println(regex); // => (?:\\.|[^ \\])+
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
String input = "234|Tamarind|something interesting ";
List<String> matchList = new ArrayList<>(  );
Matcher regexMatcher = pattern.matcher( input );
while ( regexMatcher.find() )
{
    // System.out.println("'"+regexMatcher.group()+"'");
    matchList.add( regexMatcher.group() );
}

System.out.println( "Unescaped/escaped test result with size: " + matchList.size() );
matchList.stream().forEach( System.out::println );

输出:

Unescaped/escaped test result with size: 2
234|Tamarind|something
interesting