Question

我想创建一个Java的Guava Splitter，可以将Java字符串作为一个块来处理。例如，我希望以下断言是正确的：

@Test
public void testSplitter() {
  String toSplit = "a,b,\"c,d\\\"\",e";
  List<String> expected = ImmutableList.of("a", "b", "c,d\"","e");

  Splitter splitter = Splitter.onPattern(...);
  List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));

  assertEquals(expected, actual);
}

我可以编写正则表达式来查找所有元素而不考虑'，'但是我找不到可以作为分隔符使用Splitter的正则表达式。

如果不可能，请说出来，然后我将从findAll正则表达式构建列表。

Answer 1

这似乎应该使用像opencsv这样的CSV库。分离值和处理案例如引用块就是它们的全部内容。

Answer 2

这是番石榴功能请求：http://code.google.com/p/guava-libraries/issues/detail?id=412

Answer 3

我有同样的问题（除了不需要支持转义引号字符）。我不喜欢为这么简单的东西包含另一个库。然后我开始想到，我需要一个可变的CharMatcher。与Bart Kiers的解决方案一样，它保留了引用字符。

public static Splitter quotableComma() {
    return on(new CharMatcher() {
        private boolean inQuotes = false;

        @Override
        public boolean matches(char c) {
            if ('"' == c) {
                inQuotes = !inQuotes;
            }
            if (inQuotes) {
                return false;
            }
            return (',' == c);
        }
    });
}

@Test
public void testQuotableComma() throws Exception {
    String toSplit = "a,b,\"c,d\",e";
    List<String> expected = ImmutableList.of("a", "b", "\"c,d\"", "e");
    Splitter splitter = Splitters.quotableComma();
    List<String> actual = ImmutableList.copyOf(splitter.split(toSplit));
    assertEquals(expected, actual);
}

Answer 4

你可能分裂为以下模式：

\s*,\s*(?=((\\["\\]|[^"\\])*"(\\["\\]|[^"\\])*")*(\\["\\]|[^"\\])*$)

可能看起来（有点）友好(?x)标志：

(?x)            # enable comments, ignore space-literals
\s*,\s*         # match a comma optionally surrounded by space-chars
(?=             # start positive look ahead
  (             #   start group 1
    (           #     start group 2
      \\["\\]   #       match an escaped quote or backslash
      |         #       OR
      [^"\\]    #       match any char other than a quote or backslash
    )*          #     end group 2, and repeat it zero or more times
    "           #     match a quote
    (           #     start group 3
      \\["\\]   #       match an escaped quote or backslash
      |         #       OR
      [^"\\]    #       match any char other than a quote or backslash
    )*          #     end group 3, and repeat it zero or more times
    "           #     match a quote
  )*            #   end group 1, and repeat it zero or more times
  (             #   open group 4
    \\["\\]     #     match an escaped quote or backslash
    |           #     OR
    [^"\\]      #     match any char other than a quote or backslash
  )*            #   end group 4, and repeat it zero or more times
  $             #   match the end-of-input
)               # end positive look ahead

但即使在这个评论版中，它仍然是一个怪物。用简单的英语，这个正则表达式可以解释如下：

匹配一个逗号，该逗号可选地被空格字符包围，只有在查看该逗号之前（一直到字符串的结尾！），在忽略转义的情况下，引号为零或偶数引号或转义反斜杠。

所以，在看到这个之后，你可能会同意ColinD（我知道！）在这种情况下使用某种CSV解析器是可行的。

请注意，上面的正则表达式会在令牌周围留下qoutes，即字符串a,b,"c,d\"",e（作为文字："a,b,\"c,d\\\"\",e"）将按以下方式拆分：

a
b
"c,d\""
e

Answer 5

改善@Rage-Steel的答案。

final static CharMatcher notQuoted = new CharMatcher() {
     private boolean inQuotes = false;

     @Override
     public boolean matches(char c) {
        if ('"' == c) {
        inQuotes = !inQuotes;
     }
     return !inQuotes;
};

final static Splitter SPLITTER = Splitter.on(notQuoted.and(CharMatcher.anyOf(" ,;|"))).trimResults().omitEmptyStrings();

然后，

public static void main(String[] args) {
    final String toSplit = "a=b c=d,kuku=\"e=f|g=h something=other\"";

    List<String> sputnik = SPLITTER.splitToList(toSplit);
    for (String s : sputnik)
        System.out.println(s);
}

注意线程安全（或者，简化 - 没有）

创建一个支持字符串的Guava Splitter

5 个答案: