Spark将数据帧映射到数组

时间:2018-03-23 14:04:54

标签: apache-spark spark-dataframe

我正在使用Spark MLlib PrefixSpan算法。我有一些代码在Spark 1.6中运行,但我们最近转向了Spark 2.2。

我有一个像这样的数据框

viewsPurchasesGrouped: org.apache.spark.sql.DataFrame = [session_id: decimal(29,0), view_product_ids: array<bigint> ... 1 more field]

root
 |-- session_id: decimal(29,0) (nullable = true)
 |-- view_product_ids: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- purchase_product_ids: array (nullable = true)
 |    |-- element: long (containsNull = true)

在Spark 1.6中,我使用这段代码将其转换为MLlib消耗的相应数据帧:

import scala.collection.mutable.WrappedArray

val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
  Array(
    Array(row.getAs[WrappedArray[String]](1).toArray), 
    Array(row.getAs[WrappedArray[String]](2).toArray)
  )
)

由于我们的开关,这不再起作用了。

我试过这个:

val viewsPurchasesRddString2 = viewsPurchasesGrouped.select("view_product_ids","purchase_product_ids").rdd.map( row =>
  Array(
    row.getSeq[Long](0).toArray, 
    row.getSeq[Long](1).toArray
  )
) 

并看到这个令人费解的错误消息,这意味着它使用了session_id和purchase_product_ids而不是原始数据框中的view_product_ids和purchase_product_ids。

Job aborted due to stage failure: [...] scala.MatchError: [14545234113341303814564569524,WrappedArray(123, 234, 456, 678, 789)]

我也试过这个:

val viewsPurchasesRddString = viewsPurchasesGrouped.map {
   case Row(session_id: Long, view_product_ids: Array[Long], purchase_product_ids: Array[Long]) => 
     (view_product_ids, purchase_product_ids)
}

失败
viewsPurchasesRddString: org.apache.spark.sql.Dataset[(Array[Long], Array[Long])] = [_1: array<bigint>, _2: array<bigint>]
prefixSpan: org.apache.spark.mllib.fpm.PrefixSpan = org.apache.spark.mllib.fpm.PrefixSpan@10d69876
<console>:67: error: overloaded method value run with alternatives:
  [Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
  [Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: 
scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] cannot be applied to (org.apache.spark.sql.Dataset[(Array[Long], Array[Long])])
   val model = prefixSpan.run(viewsPurchasesRddString)
                          ^

如何正确移植我的代码?

1 个答案:

答案 0 :(得分:1)

您的数据框表明这些列的类型为array<string>,因此您不应使用Seq[Long]访问这些列。在spark 1.6中,数据框上的map自动切换到RDD API,在Spark 2中,您需要使用rdd.map来执行相同的操作。所以我建议这应该有效:

val viewsPurchasesRddString = viewsPurchasesGrouped.rdd.map( row =>
  Array(
    Array(row.getAs[WrappedArray[String]](1).toArray), 
    Array(row.getAs[WrappedArray[String]](2).toArray)
  )
)