可能重复:
C#, regular expressions : how to parse comma-separated values, where some values might be quoted strings themselves containing commas
regex to parse csv
我知道这个问题花了很多时间,但有不同的答案;我很困惑。
我的行是:
1,3.2,BCD,"qwer 47"" ""dfg""",1
可选的引用和双引号MS Excel标准。 (数据:qwer 47" "dfg"
表示为"qwer 47"" ""dfg"""
。)
我需要一个正则表达式。
答案 0 :(得分:5)
好的,你从评论中看到正则表达式所以不是正确的工具。但如果你坚持,请点击这里:
这个正则表达式适用于Java(或.NET和其他支持占有量词和冗长正则表达式的实现):
^ # Start of string
(?: # Match the following:
(?: # Either match
[^",\n]*+ # 0 or more characters except comma, quote or newline
| # or
" # an opening quote
(?: # followed by either
[^"]*+ # 0 or more non-quote characters
| # or
"" # an escaped quote ("")
)* # any number of times
" # followed by a closing quote
) # End of alternation
, # Match a comma (separating the CSV columns)
)* # Do this zero or more times.
(?: # Then match
(?: # using the same rules as above
[^",\n]*+ # an unquoted CSV field
| # or a quoted CSV field
"(?:[^"]*+|"")*"
) # End of alternation
) # End of non-capturing group
$ # End of string
Java代码:
boolean foundMatch = subjectString.matches(
"(?x)^ # Start of string\n" +
"(?: # Match the following:\n" +
" (?: # Either match\n" +
" [^\",\\n]*+ # 0 or more characters except comma, quote or newline\n" +
" | # or\n" +
" \" # an opening quote\n" +
" (?: # followed by either\n" +
" [^\"]*+ # 0 or more non-quote characters\n" +
" | # or\n" +
" \"\" # an escaped quote (\"\")\n" +
" )* # any number of times\n" +
" \" # followed by a closing quote\n" +
" ) # End of alternation\n" +
" , # Match a comma (separating the CSV columns)\n" +
")* # Do this zero or more times.\n" +
"(?: # Then match\n" +
" (?: # using the same rules as above\n" +
" [^\",\\n]*+ # an unquoted CSV field\n" +
" | # or a quoted CSV field\n" +
" \"(?:[^\"]*+|\"\")*\"\n" +
" ) # End of alternation\n" +
") # End of non-capturing group\n" +
"$ # End of string");
请注意,您不能假设CSV文件中的每一行都是完整的行。您可以在CSV行中包含换行符(只要包含换行符的列用引号括起来)。这个正则表达式知道这一点,但如果你只给它一个部分行,它就会失败。这是您真正需要CSV解析器来验证CSV文件的另一个原因。这就是解析器的作用。如果您控制输入并且知道在CSV字段中永远不会有换行符,那么您可能会使用它,但只有这样。
答案 1 :(得分:0)
我在这篇博客文章中使用了regexp,这篇文章与您尝试解决的问题相同。
简而言之^(("(?:[^"]|"")*"|[^,]*)(,("(?:[^"]|"")*"|[^,]*))*)$
答案 2 :(得分:0)
我有一段时间没有做过Java,所以这里有一个伪代码。您可以将此函数用作接受表示csv行的字符串的函数。
1. Split the row by "'" delimiter into an array of strings. (method might be called string.split())
2. Iterate through the array (cells).
3. If the current string (cell) contains a double quote:
4. If it doesn't start with a quote - return false; else remove that quote
5. If it doesn't end with a quote - return false; else remove that quote
6. Iterate through the remaining characters of the string
7. If a quote is found, check if the next character is also a quote - if it is not - return false
7. End character iteration
8. End if
9. End array iteration
10. Return true