在java中基于空格分割一个字符串,用双引号和单引号转义那些空格以及前面带\

时间:2014-12-22 18:14:42

标签: java regex string

我对正则表达式完全不熟悉。我正在尝试组合一个表达式,该表达式将使用未被单引号或双引号括起的所有空格分割示例字符串,并且前面没有'\'

例如: -

He is a "man of his" words\ always

必须拆分为

He
is 
a 
"man of his"
words\ always

我理解

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(StringToBeMatched);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group());
}

l使用未被单引号或双引号

包围的所有空格分割示例字符串

如果前面有\ ??

,如何合并忽略空格的第三个条件?

3 个答案:

答案 0 :(得分:3)

您可以使用此正则表达式:

((["']).*?\2|(?:[^\\ ]+\\\s+)+[^\\ ]+|\S+)

RegEx Demo

在Java中:

Pattern regex = Pattern.compile
   ( "(([\"']).*?\2|(?:[^\\\\ ]+\\\\\s+)+[^\\\\ ]+|\\S+)" );

<强>解释

这个正则表达式适用于交替:

  1. 首先匹配([\"']).*?\\2以匹配任何引用的(双重或单个)字符串。
  2. 然后匹配(?:[^\\ ]+\\\s+)+[^\\ ]+以匹配任何带有转义空格的字符串。
  3. 最后使用\S+匹配任何没有空格的单词。

答案 1 :(得分:2)

Anubhava's解决方案很好......我特别喜欢他使用 S + 。我的解决方案在分组中类似,除了捕获第三个备用组中的开始和结束单词边界...

正则表达式

(?i)((?:(['|"]).+\2)|(?:\w+\\\s\w+)+|\b(?=\w)\w+\b(?!\w))

For Java

(?i)((?:(['|\"]).+\\2)|(?:\\w+\\\\\\s\\w+)+|\\b(?=\\w)\\w+\\b(?!\\w))

实施例

String subject = "He is a \"man of his\" words\\ always 'and forever'";
Pattern pattern = Pattern.compile( "(?i)((?:(['|\"]).+\\2)|(?:\\w+\\\\\\s\\w+)+|\\b(?=\\w)\\w+\\b(?!\\w))" );
Matcher matcher = pattern.matcher( subject );
while( matcher.find() ) {
    System.out.println( matcher.group(0).replaceAll( subject, "$1" ));
}

结果

He
is
a
"man of his"
words\ always
'and forever'

详细说明

"(?i)" +                 // Match the remainder of the regex with the options: case insensitive (i)
"(" +                    // Match the regular expression below and capture its match into backreference number 1
                            // Match either the regular expression below (attempting the next alternative only if this one fails)
      "(?:" +                  // Match the regular expression below
         "(" +                    // Match the regular expression below and capture its match into backreference number 2
            "['|\"]" +                // Match a single character present in the list “'|"”
         ")" +
         "." +                    // Match any single character that is not a line break character
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         "\\2" +                   // Match the same text as most recently matched by capturing group number 2
      ")" +
   "|" +                    // Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      "(?:" +                  // Match the regular expression below
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         "\\\\" +                   // Match the character “\” literally
         "\\s" +                   // Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      ")+" +                   // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   "|" +                    // Or match regular expression number 3 below (the entire group fails if this one fails to match)
      "\\b" +                   // Assert position at a word boundary
      "(?=" +                  // Assert that the regex below can be matched, starting at this position (positive lookahead)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
      ")" +
      "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
         "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      "\\b" +                   // Assert position at a word boundary
      "(?!" +                  // Assert that it is impossible to match the regex below starting at this position (negative lookahead)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
      ")" +
")"  

答案 2 :(得分:0)

代表\whitespace的正则表达式看起来像\\\s\\代表\\s代表任何空格。表示此类正则表达式的字符串需要写为"\\\\\\s",因为我们需要在字符串中通过添加另一个\来转义\

所以现在我们可能希望我们的模式找到

  • "..." - &gt; "[^"]*"
  • '...' - &gt; '[^']*'
  • 或非空格(\S)的字符,但也包括那些前面有\的空格(\\\s)。这个有点棘手,因为{{1也可以消耗\S放置在空格之前,这会阻止\匹配,这就是我们想要正则表达式引擎的原因

    • 首先搜索\\\s
    • 及以后\\\s

    因此,我们需要将此部分正则表达式写为 \S (因为正则表达式引擎会尝试测试和匹配由{{1}分隔的条件,而不是(\S|\\\s)+之类的内容。 } (\\\s|\S)+从左到右 - 例如在OR |这样的正则表达式的情况下永远不会匹配,因为a|ab将由正则表达式的左侧部分使用)

所以你的模式看起来像

ab