Question

我有一个CSV文件，其中包含具有9000多个记录的以下数据

 id,Category1,Category2

如何将此csv文件转换为RDD<Vector>，以便可以使用java中的Apache Spark的columnSimilarities使用它来查找相似的列。

https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#RowMatrix-org.apache.spark.rdd.RDD-

Answer 1

在我读到的内容时，Vector可以保存ID，并可以将double []作为值。您需要填充向量。

List<String> lines = Files.readAllLines(Paths.get("myfile.csv"), Charset.defaultCharset());

然后，您可以遍历行，为每行创建一个Vector，用值填充（您需要解析它们）并将它们添加到RDD中

Answer 2

您可以尝试以下方法：

sparkSession.read.csv(myCsvFilePath) // you should have a DataFrame here
  .map((r: Row) => Vector.dense(r.getInt(0), r.getInt(1), r.getInt(2))) // you should have a Dataset of Vector
  .rdd // you have your RDD[Vector]

如果不起作用，请随时与我们联系。

如何将.csv文件转换为RDD <Vector>？

2 个答案: