如何在Spark中为Row,LabeledPointData设置编码器?

时间:2017-05-29 09:43:18

标签: apache-spark apache-spark-sql apache-spark-dataset apache-spark-encoders

如何设置LabeledPointData的编码器,它是Double,Vector of Vectors的组合。如何设置编码器来创建DataFrame?

public static Dataset<LabeledPoint> convertRDDStringToLabeledPoint(Dataset<String> data,String delimiter) {
    Dataset<LabeledPoint> labeledPointData = data.map(
            (data1)->{
                String splitter[] = data1.split(delimiter);
                double[] arr = new double[splitter.length - 1];
                IntStream.range(0,arr.length).forEach(i->arr[i]=Double.parseDouble(splitter[i+1]));
                return new LabeledPoint(Double.parseDouble(splitter[0]), Vectors.dense(arr));
            },Encoders.???);
    return labeledPointData;
}

1 个答案:

答案 0 :(得分:1)

LabeledPoint是Scala中的案例类,因此我认为它是Encoders.product[LabeledPoint]

(我不知道如何用Java编写)