Question

如何将DTO列表转换为Spark ML输入数据集格式？

我有DTO：

public class MachineLearningDTO implements Serializable {
    private double label;
    private double[] features;

    public MachineLearningDTO() {
    }

    public MachineLearningDTO(double label, double[] features) {
        this.label = label;
        this.features = features;
    }

    public double getLabel() {
        return label;
    }

    public void setLabel(double label) {
        this.label = label;
    }

    public double[] getFeatures() {
        return features;
    }

    public void setFeatures(double[] features) {
        this.features = features;
    }
}

代码：

Dataset<MachineLearningDTO> mlInputDataSet = spark.createDataset(mlInputData, Encoders.bean(MachineLearningDTO.class));
LogisticRegression logisticRegression = new LogisticRegression();
LogisticRegressionModel model = logisticRegression.fit(MLUtils.convertMatrixColumnsToML(mlInputDataSet));

执行代码后我得到了：

java.lang.IllegalArgumentException：要求失败：列功能必须是org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7类型但实际上是ArrayType（DoubleType，false）。

如果使用代码将其更改为org.apache.spark.ml.linalg.VectorUDT：

VectorUDT vectorUDT = new VectorUDT();
vectorUDT.serialize(Vectors.dense(......));

然后我得到了：

java.lang.UnsupportedOperationException：无法推断类的类型   org.apache.spark.ml.linalg.VectorUDT，因为它不符合bean标准

在   org.apache.spark.sql.catalyst.JavaTypeInference $ $ .ORG阿帕奇$ $火花SQL $ $催化剂$$ JavaTypeInference serializerFor（JavaTypeInference.scala：437）

Answer 1

我已经想通了，万一有人也会坚持下去，我写了简单的转换器并且它可以工作：

private Dataset<Row> convertToMlInputFormat(List< MachineLearningDTO> data) {
    List<Row> rowData = data.stream()
            .map(dto ->
                    RowFactory.create(dto.getLabel() ? 1.0d : 0.0d, Vectors.dense(dto.getFeatures())))
            .collect(Collectors.toList());
    StructType schema = new StructType(new StructField[]{
            new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
            new StructField("features", new VectorUDT(), false, Metadata.empty()),
    });

    return spark.createDataFrame(rowData, schema);
}

使用Java的Spark MLlib分类输入格式

1 个答案: