columnSimilarities()返回Spark数据框

时间:2017-02-25 11:54:04

标签: scala apache-spark apache-spark-sql spark-dataframe apache-spark-mllib

我对Spark 2.1中的CosineSimilarity / ColumnSimilarities有第二个问题。我对scala和所有Spark环境都不熟悉,这对我来说并不是很清楚:

如何从spark中的rowMatrix中为每个列组合取回ColumnSimilarities。这是我试过的:

数据:

import org.apache.spark.sql.{SQLContext, Row, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType}
import org.apache.spark.sql.functions._

// rdd
    val rowsRdd: RDD[Row] = sc.parallelize(
      Seq(
        Row(2.0, 7.0, 1.0),
        Row(3.5, 2.5, 0.0),
        Row(7.0, 5.9, 0.0)
      )
    )

// Schema  
    val schema = new StructType()
      .add(StructField("item_1", DoubleType, true))
      .add(StructField("item_2", DoubleType, true))
      .add(StructField("item_3", DoubleType, true))

// Data frame  
    val df = spark.createDataFrame(rowsRdd, schema) 

代码:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}

val rows = new VectorAssembler().setInputCols(df.columns).setOutputCol("vs")
  .transform(df)
  .select("vs")
  .rdd

val items_mllib_vector = rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
                             .map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(items_mllib_vector)
val simsPerfect = mat.columnSimilarities()


println("Pairwise similarities are: " +   simsPerfect.entries.collect.mkString(", "))

输出:

Pairwise similarities are: MatrixEntry(0,2,0.24759378423606918), MatrixEntry(1,2,0.7376189553526812), MatrixEntry(0,1,0.8355316482961213)

所以我得到的是我的专栏和相似之处的simsPerfect org.apache.spark.mllib.linalg.distributed.CoordinateMatrix。我如何将其转换回数据帧并使用它获得正确的列名称?

我的首选输出:

    item_from | item_to | similarity
            1 |       2 |      0.83 |             
            1 |       3 |      0.24 |
            2 |       3 |      0.73 | 

提前致谢

2 个答案:

答案 0 :(得分:3)

此方法也可以在不将行转换为String的情况下工作:

val transformedRDD = simsPerfect.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => (row,col,sim)}
val dff = sqlContext.createDataFrame(transformedRDD).toDF("item_from", "item_to", "sim")

其中,我假设val sqlContext = new org.apache.spark.sql.SQLContext(sc)已经定义,sc是SparkContext。

答案 1 :(得分:0)

我找到了解决问题的方法:

//Transform result to rdd
val transformedRDD = simsPerfect.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}

//Transform rdd[String] to rdd[Row]
val rdd2 = transformedRDD.map(a => Row(a))

// to DF
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = spark.createDataFrame(rdd2,dfschema) 

//create new DF with schema
val newdf = rddToDF.select(expr("(split(value, ','))[0]").cast("string").as("item_from")
              ,expr("(split(value, ','))[1]").cast("string").as("item_to")
              ,expr("(split(value, ','))[2]").cast("string").as("sim"))

我确信还有另一种更简单的方法可以做到这一点,但我很高兴它有效。