在某些情况下,使用双引号解析CSV

时间:2011-10-17 22:44:45

标签: java parsing csv

我有格式附带的csv:

a1,a2,a3,“a4,a5”,a6

只有字段,会有引号

使用Java,如何轻松解析这个?我尽量避免使用开源CSV解析器作为公司策略。感谢。

4 个答案:

答案 0 :(得分:22)

您可以将Matcher.find与以下正则表达式一起使用:

\s*("[^"]*"|[^,]*)\s*

这是一个更完整的例子:

String s = "a1, a2, a3, \"a4,a5\", a6";
Pattern pattern = Pattern.compile("\\s*(\"[^\"]*\"|[^,]*)\\s*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

查看在线工作:ideone

答案 1 :(得分:3)

我遇到了同样的问题(但在Python中),我找到解决它的一种方法,没有正则表达式,是: 当您获得该行时,检查是否有引号,如果有引号,则将字符串拆分为引号,并在逗号上拆分结果数组的偶数索引结果。奇数索引字符串应该是完整的引用值。

我不是Java编码器,所以把它作为伪代码......

line = String[];
    if ('"' in row){
        vals = row.split('"');
        for (int i =0; i<vals.length();i+=2){
            line+=vals[i].split(',');
        }
        for (int j=1; j<vals.length();j+=2){
            line+=vals[j];
        }
    }
    else{
        line = row.split(',')
    }

或者,使用正则表达式。

答案 2 :(得分:3)

以下是一些代码,我希望使用这里的代码不算开源,这是。

package bestsss.util;

import java.io.BufferedReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class SplitCSVLine {
    public static String[] splitCSV(BufferedReader reader) throws IOException{
        return splitCSV(reader, null, ',', '"');
    }

    /**
     * 
     * @param reader - some line enabled reader, we lazy
     * @param expectedColumns - convenient int[1] to return the expected
     * @param separator - the C(omma) SV (or alternative like semi-colon) 
     * @param quote - double quote char ('"') or alternative
     * @return String[] containing the field
     * @throws IOException
     */
    public static String[] splitCSV(BufferedReader reader, int[] expectedColumns, char separator, char quote) throws IOException{       
        final List<String> tokens = new ArrayList<String>(expectedColumns==null?8:expectedColumns[0]);
        final StringBuilder sb = new StringBuilder(24);

        for(boolean quoted=false;;sb.append('\n')) {//lazy, we do not preserve the original new line, but meh
            final String line = reader.readLine();
            if (line==null)
                break;
            for (int i = 0, len= line.length(); i < len; i++) { 
                final char c = line.charAt(i);
                if (c == quote) {
                    if( quoted   && i<len-1 && line.charAt(i+1) == quote ){//2xdouble quote in quoted 
                        sb.append(c);
                        i++;//skip it
                    }else{
                        if (quoted){
                            //next symbol must be either separator or eol according to RFC 4180
                            if (i==len-1 || line.charAt(i+1) == separator){
                                quoted = false;
                                continue;
                            }
                        } else{//not quoted
                            if (sb.length()==0){//at the very start
                                quoted=true;
                                continue;
                            }
                        }
                        //if fall here, bogus, just add the quote and move on; or throw exception if you like to
                        /*
                        5.  Each field may or may not be enclosed in double quotes (however
                           some programs, such as Microsoft Excel, do not use double quotes
                           at all).  If fields are not enclosed with double quotes, then
                           double quotes may not appear inside the fields.
                      */ 
                        sb.append(c);                   
                    }
                } else if (c == separator && !quoted) {
                    tokens.add(sb.toString());
                    sb.setLength(0); 
                } else {
                    sb.append(c);
                }
            }
            if (!quoted)
                break;      
        }
        tokens.add(sb.toString());//add last
        if (expectedColumns !=null)
            expectedColumns[0] = tokens.size();
        return tokens.toArray(new String[tokens.size()]);
    }
    public static void main(String[] args) throws Throwable{
        java.io.StringReader r = new java.io.StringReader("222,\"\"\"zzzz\", abc\"\" ,   111   ,\"1\n2\n3\n\"");
        System.out.println(java.util.Arrays.toString(splitCSV(new BufferedReader(r))));
    }
}

答案 3 :(得分:1)

以下代码似乎运行良好,可以处理引号内的引号。

final static Pattern quote = Pattern.compile("^\\s*\"((?:[^\"]|(?:\"\"))*?)\"\\s*,");

public static List<String> parseCsv(String line) throws Exception
{       
    List<String> list = new ArrayList<String>();
    line += ",";

    for (int x = 0; x < line.length(); x++)
    {
        String s = line.substring(x);
        if (s.trim().startsWith("\""))
        {
            Matcher m = quote.matcher(s);
            if (!m.find())
                throw new Exception("CSV is malformed");
            list.add(m.group(1).replace("\"\"", "\""));
            x += m.end() - 1;
        }
        else
        {
            int y = s.indexOf(",");
            if (y == -1)
                throw new Exception("CSV is malformed");
            list.add(s.substring(0, y));
            x += y;
        }
    }
    return list;
}