试图解析格式错误的CSV

时间:2018-03-05 22:34:03

标签: java regex

因此,我正在从公开的政府CSV文件中解析数千行。问题是他们在带有双引号的值中包含了逗号,这使得一致地解析变得非常困难。匹配的数量应该是251.我已经尝试否定双引号,但这似乎也不起作用。

示例:

GS08P12VJP0107,0,0,,,,0,5300.00,5300.00,5300.00,2012-09-21,2012-09-21 00:00:00,2012-11-01 00:00:00,2012-11-01 00:00:00,,047,GENERAL SERVICES ADMINISTRATION (GSA),4740,PUBLIC BUILDINGS SERVICE,VJ000,"GSA/PBS/MTN PLAINS SVS CTR, NORTH DAKOTA FIELD OFFICE",047,GENERAL SERVICES ADMINISTRATION (GSA),4740,PUBLIC BUILDINGS SERVICE,VJ000,"GSA/PBS/MTN PLAINS SVS CTR, NORTH DAKOTA FIELD OFFICE",,,043570956,MIKE AUSTFJORD & SONS INC,,MIKE AUSTFJORD & SONS INC,043570956,UNITED STATES,,9469 138TH AVE NE,,,CAVALIER,ND,,582209505,ND00,7012654255,7012653110,USA,UNITED STATES,PEMBINA,PEMBINA,ND,NORTH DAKOTA,582719745,00,,B,PO,,,,,,NAN,J,FIRM FIXED PRICE,"EXCAVATE WETLANDS AS REMEDIATION AT US BORDER STATION, 10980 I-29, PEMBINA, NORTH DAKOTA.",,,,1,Z2AA,REPAIR OR ALTERATION OF OFFICE BUILDINGS,D,NOT A BUNDLED REQUIREMENT,,,238910,SITE PREPARATION CONTRACTORS,A,FAR 52.223-4 INCLUDED,A,U.S. OWNED BUSINESS,,,,,B,JUSTIFICATION - TIME,USA,,C,NOT A MANUFACTURED END PRODUCT,B,PLAN NOT REQUIRED,F,COMPETED UNDER SAP,SP1,SIMPLIFIED ACQUISITION,SBA,SMALL BUSINESS SET ASIDE - TOTAL,NONE,NO PREFERENCE USED,,NAN,,NAN,,,1,D,,f,N,NO,NO,,X,NOT APPLICABLE,N,,,N: NO,,X,NOT APPLICABLE,X,NOT APPLICABLE,Y,YES,X,NOT APPLICABLE,,,,,,,,NONE,NONE,,,,NAN,N,TRANSACTION DOES NOT USE GFE/GFP,,,X,NO,N,NO,N,NO - SERVICE WHERE PBA IS NOT USED.,,,,,N,NO,X,NOT APPLICABLE,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,SMALL BUSINESS,S,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,f,f,f,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,2012-09-21 00:00:00

有人可以帮忙吗?我通过Java Pattern / Matcher做到这一点..

1 个答案:

答案 0 :(得分:0)

有一些不同的模式组需要考虑。打破你的例子到各种情况,我提出了以下正则表达式

(\"(.*?)\")|(.*?(,))|(.*)

第一个捕获组(\"(.*?)\")处理引号中的值。

第二个,(.*?(,))处理其他情况(无引号)。

最后一个,(.*)用于csv的最后部分,没有结尾逗号。

修改

这篇文章得到的评论比我预期的要多。

当然上述解决方案还有改进的余地,例如它不考虑双引号,并且它包括值中的尾随逗号。用户提到他们试图解决模式/匹配器的问题,所以使用适合他们用例的正则表达式,这样的事情

Pattern p = Pattern.compile(someRegex);
String line = ... // get line from somewhere
Matcher m = p.matcher(line);

while (m.find()) {
    // do stuff
}

可能就足够了。

一位用户建议使用Apache Commons CSV,可以在https://mvnrepository.com/artifact/org.apache.commons/commons-csv/1.5找到(在撰写本文时为最新版本)。

for (CSVRecord record : CSVFormat.DEFAULT.parse(new FileReader(source))) {
    Iterator<String> it = record.iterator();
    while (it.hasNext()) {
        String colVal = it.next();
        // do stuff
    }
}

有关实际使用案例,请参阅https://commons.apache.org/proper/commons-csv/user-guide.html上的文档。