我需要在scala spark中创建列联表。我试着开发我的代码如下。我需要将org.apache.spark.sql.DataFrame
转换为org.apache.spark.mllib.linalg.Matrix
。我经常搜索;但我最常发现从矩阵到DataFrame的样本。提前谢谢你的帮助
这是我的DataFrame
scazla> val ff: df.stat.crosstab("firstAttr", "secondAttr")
scala> val myDf = ff.select("no", "yes")
myDf: org.apache.spark.sql.DataFrame = [no: bigint, yes: bigint]
scala> myDf.show()
+---+---+
| no|yes|
+---+---+
|332| 16|
|180| 13|
| 20| 3|
| 21| 3|
+---+---+
答案 0 :(得分:0)
必需的导入
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.sql.Row
数据:
val df = Seq((332, 16), (180, 13), (20, 3), (21, 3)).toDF("no", "yes")
展平并收集结果:
val values = df
.select($"no".cast("double"), $"yes".cast("double"))
.map { case Row(yes: Double, no: Double) => Seq(yes, no) }
.collect
.toSeq
.transpose
.flatten
创建Matrix
:
Matrices.dense(df.count.toInt, df.columns.size, values.toArray)
// res8: org.apache.spark.mllib.linalg.Matrix =
// 332.0 16.0
// 180.0 13.0
// 20.0 3.0
// 21.0 3.0