我甚至在第3页搜索谷歌这个问题,但似乎没有适当的解决方案。
以下字符串
"zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\'polo'"
应该用Java中的逗号分割。报价可以是双引号或单引号。我尝试了以下正则表达式
,(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)
但由于'marc o \'polo'的引用被转义,它失败了......
有人可以帮帮我吗?
试用代码:
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc \'opolo'";
Pattern COMMA_PATTERN = Pattern.compile(",(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)");
String[] splits = COMMA_PATTERN.split(checkString);
for (String split : splits) {
System.out.println(split);
}
答案 0 :(得分:4)
你可以这样做:
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);
while(m.find()) {
result.add(m.group());
}
答案 1 :(得分:1)
使用正则表达式拆分CSV并不是正确的解决方案......这可能是您在使用split / csv / regex搜索词找到一个难的原因。
使用带状态机的专用库通常是最佳解决方案。其中有很多:
我可以说,正则表达式和CSV相对较快地变得非常非常复杂(正如您所发现的那样),而且仅出于性能原因,“原始”解析器更好。
答案 2 :(得分:0)
如果您正在解析CVS(或类似的东西),而不是使用其中一个已建立的框架,通常是一个好主意,因为它们涵盖了大多数角落案例,并且在更广泛的受众中通过不同项目的使用进行测试。
但是,如果您无法选择库,则可以使用例如库。这样:
public class Curios {
public static void main(String[] args) {
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\\'polo'";
List<String> result = splitValues(checkString);
System.out.println(result);
System.out.println(splitValues("zhg\\,wi\\'mö,'astor wohnideen','multistore 2002',\"yo\\\"nza\",'asdf, saflk\\\\','marc o\\'polo',"));
}
public static List<String> splitValues(String checkString) {
List<String> result = new ArrayList<String>();
// Used for reporting errors and detecting quotes
int startOfValue = 0;
// Used to mark the next character as being escaped
boolean charEscaped = false;
// Is the current value quoted?
boolean quoted = false;
// Quote-character in use (only valid when quoted == true)
char quote = '\0';
// All characters read from current value
final StringBuilder currentValue = new StringBuilder();
for (int i = 0; i < checkString.length(); i++) {
final char charAt = checkString.charAt(i);
if (i == startOfValue && !quoted) {
// We have not yet decided if this is a quoted value, but we are right at the beginning of the next value
if (charAt == '\'' || charAt == '"') {
// This will be a quoted String
quote = charAt;
quoted = true;
startOfValue++;
continue;
}
}
if (!charEscaped) {
if (charAt == '\\') {
charEscaped = true;
} else if (quoted && charAt == quote) {
if (i + 1 == checkString.length()) {
// So we don't throw an exception
quoted = false;
// Last value will be added to result outside loop
break;
} else if (checkString.charAt(i + 1) == ',') {
// Ensure we don't parse , again
i++;
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
quoted = false;
} else {
throw new IllegalStateException(String.format(
"Value was quoted with %s but prematurely terminated at position %d " +
"maybe a \\ is missing before this %s or a , after? " +
"Value up to this point: \"%s\"",
quote, i, quote, checkString.substring(startOfValue, i + 1)));
}
} else if (!quoted && charAt == ',') {
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
} else {
// a boring character
currentValue.append(charAt);
}
} else {
// So we don't forget to reset for next char...
charEscaped = false;
// Here we can do interpolations
switch (charAt) {
case 'n':
currentValue.append('\n');
break;
case 'r':
currentValue.append('\r');
break;
case 't':
currentValue.append('\t');
break;
default:
currentValue.append(charAt);
}
}
}
if(charEscaped) {
throw new IllegalStateException("Input ended with a stray \\");
} else if (quoted) {
throw new IllegalStateException("Last value was quoted with "+quote+" but there is no terminating quote.");
}
// Add the last value to the result
result.add(currentValue.toString());
return result;
}
}
为什么不简单地使用正则表达式?
正则表达式不能很好地理解嵌套。虽然卡西米尔的正则表达式确实很好,但引用值和未引用值之间的差异更容易在某种形式的状态机中建模。您会发现确保不会意外匹配ecaped或引用的,
是多么困难。此外,当您已经在评估每个字符时,很容易解释转义序列,如\n
需要注意什么?
\n
作为{时,我的函数将像大多数C风格的语言解释器一样解释转义序列\r
,\t
,\\
,\x
{1}}(这很容易改变)