正则表达式用嵌套引号解析csv

时间:2011-10-30 19:37:31

标签: regex csv

  

可能重复:
  C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas
  regex to parse csv

我知道这个问题花了很多时间,但有不同的答案;我很困惑。

我的行是:

1,3.2,BCD,"qwer 47"" ""dfg""",1

可选的引用和双引号MS Excel标准。 (数据:qwer 47" "dfg"表示为"qwer 47"" ""dfg"""。)

我需要一个正则表达式。

3 个答案:

答案 0 :(得分:5)

好的,你从评论中看到正则表达式所以不是正确的工具。但如果你坚持,请点击这里:

这个正则表达式适用于Java(或.NET和其他支持占有量词和冗长正则表达式的实现):

^            # Start of string
(?:          # Match the following:
 (?:         #  Either match
  [^",\n]*+  #   0 or more characters except comma, quote or newline
 |           #  or
  "          #   an opening quote
  (?:        #   followed by either
   [^"]*+    #    0 or more non-quote characters
  |          #   or
   ""        #    an escaped quote ("")
  )*         #   any number of times
  "          #   followed by a closing quote
 )           #  End of alternation
 ,           #  Match a comma (separating the CSV columns)
)*           # Do this zero or more times.
(?:          # Then match
 (?:         #  using the same rules as above
  [^",\n]*+  #  an unquoted CSV field
 |           #  or a quoted CSV field
  "(?:[^"]*+|"")*"
 )           #  End of alternation
)            # End of non-capturing group
$            # End of string

Java代码:

boolean foundMatch = subjectString.matches(
    "(?x)^         # Start of string\n" +
    "(?:           # Match the following:\n" +
    " (?:          #  Either match\n" +
    "  [^\",\\n]*+ #   0 or more characters except comma, quote or newline\n" +
    " |            #  or\n" +
    "  \"          #   an opening quote\n" +
    "  (?:         #   followed by either\n" +
    "   [^\"]*+    #    0 or more non-quote characters\n" +
    "  |           #   or\n" +
    "   \"\"       #    an escaped quote (\"\")\n" +
    "  )*          #   any number of times\n" +
    "  \"          #   followed by a closing quote\n" +
    " )            #  End of alternation\n" +
    " ,            #  Match a comma (separating the CSV columns)\n" +
    ")*            # Do this zero or more times.\n" +
    "(?:           # Then match\n" +
    " (?:          #  using the same rules as above\n" +
    "  [^\",\\n]*+ #  an unquoted CSV field\n" +
    " |            #  or a quoted CSV field\n" +
    "  \"(?:[^\"]*+|\"\")*\"\n" +
    " )            #  End of alternation\n" +
    ")             # End of non-capturing group\n" +
    "$             # End of string");

请注意,您不能假设CSV文件中的每一行都是完整的行。您可以在CSV行中包含换行符(只要包含换行符的列用引号括起来)。这个正则表达式知道这一点,但如果你只给它一个部分行,它就会失败。这是您真正需要CSV解析器来验证CSV文件的另一个原因。这就是解析器的作用。如果您控制输入并且知道在CSV字段中永远不会有换行符,那么您可能会使用它,但只有这样。

答案 1 :(得分:0)

我在这篇博客文章中使用了regexp,这篇文章与您尝试解决的问题相同。

请在此处查看:http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html

简而言之^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$

答案 2 :(得分:0)

我有一段时间没有做过Java,所以这里有一个伪代码。您可以将此函数用作接受表示csv行的字符串的函数。

1. Split the row by "'" delimiter into an array of strings. (method might be called string.split())
2. Iterate through the array (cells).
    3. If the current string (cell) contains a double quote:
        4. If it doesn't start with a quote - return false; else remove that quote
        5. If it doesn't end with a quote - return false; else remove that quote
        6. Iterate through the remaining characters of the string
            7. If a quote is found, check if the next character is also a quote - if it is not - return false
        7. End character iteration
    8. End if
9. End array iteration
10. Return true