正则表达式将用什么来查找CSV文件中由双引号引起的列中包含的2个未转义双引号的集合?
不匹配:
"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"
匹配
"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"
答案 0 :(得分:3)
试试这个:
(?m)""(?![ \t]*(,|$))
说明:
(?m) // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
"" // match two successive double quotes
(?! // start negative look ahead
[ \t]* // zero or more spaces or tabs
( // open group 1
, // match a comma
| // OR
$ // the end of the line or string
) // close group 1
) // stop negative look ahead
所以,用简单的英语:“匹配两个连续的双引号,只有当它们之前没有逗号或行尾时,可选择空格和制表符”
(i)除了是正常的字符串开头和字符串结尾元字符外。
答案 1 :(得分:2)
由于问题的复杂性,解决方案取决于您使用的引擎。这是因为要解决这个问题,你必须使用后视并向前看,每个引擎都不一样。
我的回答是使用Ruby引擎。检查只是一个RegEx,但我在这里完整的代码以便更好地解释它。
请注意,由于Ruby RegEx引擎(或我的知识),无法选择前瞻/后退。所以我在逗号之前和之后需要一个小空格问题。
这是我的代码:
orgTexts = [
'"asdf","asdf"',
'"", "asdf"',
'"asdf", ""',
'"adsf", "", "asdf"',
'"asdf""asdf", "asdf"',
'"asdf", """asdf"""',
'"asdf", """"'
]
orgTexts.each{|orgText|
# Preprocessing - Eliminate spaces before and after comma
# Here is needed if you may have spaces before and after a valid comma
orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')
# Detect valid character (non-quote and valid quote)
resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
# resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
# [^\"] ===> A non qoute
# | ===> or
# ^\" ===> beginning quot
# | ===> or
# \"$ ===> endding quot
# | ===> or
# (?<=,)\" ===> quot just after comma
# \"(?=,) ===> quot just before comma
# (?<=\\\\)\" ===> escaped quot
# This part is to show the invalid non-escaped quots
print orgText
print resText.gsub(Regexp.new('"'), '^')
# This part is to determine if there is non-escaped quotes
# Here is the actual matching, use this one if you don't want to know which quote is un-escaped
isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
# Basicall, it match it from start to end (^...$) there is only a valid character
print orgText + ": " + isMatch
print
print ""
print ""
}
执行代码时打印:
"asdf","asdf"
-------------
"asdf","asdf": false
"","asdf"
---------
"","asdf": false
"asdf",""
---------
"asdf","": false
"adsf","","asdf"
----------------
"adsf","","asdf": false
"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true
"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true
"asdf",""""
--------^^-
"asdf","""": true
我希望我在这里给你一些想法,你可以使用其他引擎和语言。
答案 2 :(得分:0)
".*"(\n|(".*",)*)
应该有用,我猜......
答案 3 :(得分:0)
对于单线比赛:
^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"
或多行:
(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"
编辑/注意:根据所使用的正则表达式引擎,您可以使用lookbehinds和其他东西来使正则表达式更精简。但这应该适用于大多数正则表达式引擎。
答案 4 :(得分:0)
试试这个正则表达式:
"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"
这将匹配任何带引号的字符串和至少一对未转义的双引号。