Question

我有一个包含两列的CSV文件

id, features

id列是一个字符串，features列是机器学习算法的特征值的逗号分隔列表，即。 “[1,4,5]”我基本上只需要在值上调用Vectors.parse（）来获取向量，但我不想先转换为RDD。

我希望将其添加到Spark Dataframe中，其中features列为org.apache.spark.mllib.linalg.Vector

我正在使用databricks csv api将其读入数据框，我正在尝试将features列转换为Vector。

有人知道如何在Java中执行此操作吗？

Answer 1

我找到了一种使用UDF的方法。还有其他方法吗？

  HashMap<String, String> options = new HashMap<String, String>();
  options.put("header", "true");
  String input= args[0];

  sqlc.udf().register("toVector", new UDF1<String, Vector>() {
     @Override
     public Vector call(String t1) throws Exception {
        return Vectors.parse(t1);
     }
  }, new VectorUDT());

  StructField[] fields = {new StructField("id",DataTypes.StringType,false, Metadata.empty()) , new StructField("features", DataTypes.StringType, false, Metadata.empty())};
  StructType schema = new StructType(fields);

  DataFrame df = sqlc.read().format("com.databricks.spark.csv").schema(schema).options(options).load(input);

  df = df.withColumn("features", functions.callUDF("toVector", df.col("features")));

在Java中使用Spark Dataframe将CSV值转换为Vector

1 个答案: