我一直在尝试将RDD转换为数据帧。为此,需要定义类型而不是任何类型。我正在使用spark MLLib PrefixSpan,这就是freqSequence.sequence来自的地方。我从一个包含Session_ID,视图和购买的数据框开始,作为String-Arrays:
viewsPurchasesGrouped: org.apache.spark.sql.DataFrame =
[session_id: decimal(29,0), view_product_ids: array[string], purchase_product_ids: array[string]]
然后我计算频繁模式并在数据帧中需要它们,以便我可以将它们写入Hive表。
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row => Array(Array(row(1)), Array(row(2)) ))
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
val freqSequencesRdd = sc.parallelize(model.freqSequences.collect())
case class FreqSequences(views: Array[String], purchases: Array[String], support: Long)
val viewsPurchasesDf = freqSequencesRdd.map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
viewsPurchasesDf.toDF() // optional
当我尝试运行它时,视图和购买是“任何”而不是“数组[字符串]”。我拼命试图转换它们,但我得到的最好的是Array [Any]。我想我需要将内容映射到一个字符串,我试过,例如这个:How to get an element in WrappedArray: result of Dataset.select("x").collect()?和这个:How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala)以及数千个其他Stackoverflow问题......
我真的不知道如何解决这个问题。我想我已经将初始数据帧/ RDD转换为很多,但无法理解在哪里。
答案 0 :(得分:1)
我解决了这个问题。作为参考,这有效:
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
Array(
row.getSeq[Long](1).toArray,
row.getSeq[Long](2).toArray
)
)
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
case class FreqSequences(views: Long, purchases: Long, frequence: Long)
val ps_frequences = model.freqSequences.filter(fs => fs.sequence.length > 1).map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
ps_frequences.toDF()
答案 1 :(得分:0)
我认为问题是你有DataFrame
,它不保留静态类型信息。当您从Row
中取出一个项目时,您必须明确告诉它您希望获得哪种类型。
未经测试,但根据您提供的信息推断:
import scala.collection.mutable.WrappedArray
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
Array(
Array(row.getAs[WrappedArray[String]](1).toArray),
Array(row.getAs[WrappedArray[String]](2).toArray)
)
)