将spark DataFrame转换为MlLib Matrix

时间:2018-04-08 21:07:29

标签: scala apache-spark spark-dataframe apache-spark-mllib

我需要在scala spark中创建列联表。我试着开发我的代码如下。我需要将org.apache.spark.sql.DataFrame转换为org.apache.spark.mllib.linalg.Matrix。我经常搜索;但我最常发现从矩阵到DataFrame的样本。提前谢谢你的帮助 这是我的DataFrame

scazla> val ff: df.stat.crosstab("firstAttr", "secondAttr")    
scala> val myDf = ff.select("no", "yes")
myDf: org.apache.spark.sql.DataFrame = [no: bigint, yes: bigint]

scala> myDf.show()
+---+---+
| no|yes|
+---+---+
|332| 16|
|180| 13|
| 20|  3|
| 21|  3|
+---+---+

1 个答案:

答案 0 :(得分:0)

必需的导入

import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.sql.Row

数据:

val df = Seq((332, 16), (180, 13), (20, 3), (21, 3)).toDF("no", "yes")

展平并收集结果:

val values = df
  .select($"no".cast("double"), $"yes".cast("double"))
  .map { case Row(yes: Double, no: Double) => Seq(yes, no) }
  .collect
  .toSeq
  .transpose
  .flatten

创建Matrix

 Matrices.dense(df.count.toInt, df.columns.size, values.toArray)
 // res8: org.apache.spark.mllib.linalg.Matrix =                                    
 // 332.0  16.0
 // 180.0  13.0
 // 20.0   3.0
 // 21.0   3.0