如何将csv映射的bean类传递给Dataset

时间:2017-08-30 11:24:38

标签: java apache-spark

我编写代码来读取csv文件并将所有列映射到bean类。 现在,我正在尝试将这些值设置为数据集并遇到问题。

7/08/30 16:33:58 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.IllegalArgumentException: object is not an instance of declaring class
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

如果我尝试手动设置值,它可以正常工作

public void run(String t, String u) throws FileNotFoundException {

    JavaRDD<String> pairRDD =  sparkContext.textFile("C:/temp/L1_result.csv");
    JavaPairRDD<String,String> rowJavaRDD = pairRDD.mapToPair(new PairFunction<String, String, String>() {

        public Tuple2<String,String> call(String rec) throws FileNotFoundException {
            String[] tokens = rec.split(";");
            String[] vals = new String[tokens.length];
            for(int i= 0; i < tokens.length; i++){
                vals[i] =tokens[i];
            }

            return new Tuple2<String, String>(tokens[0], tokens[1]);
        }
    });


    ColumnPositionMappingStrategy cpm = new ColumnPositionMappingStrategy();
    cpm.setType(funds.class);
    String[] csvcolumns = new String[]{"portfolio_id", "portfolio_code"};
    cpm.setColumnMapping(csvcolumns);

    CSVReader csvReader = new CSVReader(new FileReader("C:/temp/L1_result.csv"));

    CsvToBean csvtobean = new CsvToBean();
    List csvDataList = csvtobean.parse(cpm, csvReader);

    for (Object dataobject : csvDataList) {
        funds fund = (funds) dataobject;
        System.out.println("Portfolio:"+fund.getPortfolio_id()+ " code:"+fund.getPortfolio_code());
    }

    /*  funds b0 = new funds();
    b0.setK("k0");
    b0.setSomething("sth0");
    funds b1 = new funds();
    b1.setK("k1");
    b1.setSomething("sth1");
    List<funds> data = new ArrayList<funds>();
    data.add(b0);
    data.add(b1);*/

    System.out.println("Portfolio:" + rowJavaRDD.values());


    //manual set works fine ///
    //  Dataset<Row> fundDf = SQLContext.createDataFrame(data, funds.class);
    Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);
    fundDf.printSchema();
    fundDf.write().option("mergeschema", true).parquet("C:/test");
}

以下行提出了一个问题:使用rowJavaRDD.values()

Dataset<Row> fundDf = SQLContext.createDataFrame(rowJavaRDD.values(), funds.class);

这是什么决议?我应该在这里传递Im列映射的任何值,但是如何做到这一点。任何想法都对我有所帮助。

1 个答案:

答案 0 :(得分:0)

Dataset fundDf = SQLContext.createDataFrame(csvDataList,funds.class);

传递清单有效!

相关问题