火花中的FP增长模型

时间:2018-08-12 01:03:53

标签: scala apache-spark machine-learning apache-spark-mllib

我正在尝试使用spark 2.2 MLlib使用以下代码在spark中运行FP增长算法:

val fpgrowth = new FPGrowth()
  .setItemsCol("items")
  .setMinSupport(0.5)
  .setMinConfidence(0.6)
val model = fpgrowth.fit(dataset1)

从SQL代码中提取dataset的位置:

select items from MLtable

此表中items列的输出如下:

"NFL Cricket MLB Unknown1 Unknown2 Unknown Unknown Unknown",
"Unknown Unknown Unknown Unknown Unknown CCC DDD RRR",
"Unknown Unknown Unknown Unknown CFB Unknown Unknown Unknown",
"Unknown Cricket Unknown Unknown Unknown Unknown Unknown Unknown",
"NFL Unknown MLB NBA CFB Unknown Unknown Unknown"

每当我尝试运行ML模型时,都会遇到以下错误:

  

交易中的项目必须唯一,但必须具有WrappedArray

我尝试多次尝试,但是遇到了错误。非常感谢您的帮助。

1 个答案:

答案 0 :(得分:2)

错误消息告诉您,事务中的项必须是唯一的

import org.apache.spark.sql.functions.{split, udf}

val df = Seq(
  "NFL Cricket MLB Unknown1 Unknown2 Unknown Unknown Unknown",
  "Unknown Unknown Unknown Unknown Unknown CCC DDD RRR",
  "Unknown Unknown Unknown Unknown CFB Unknown Unknown Unknown",
  "Unknown Cricket Unknown Unknown Unknown Unknown Unknown Unknown",
  "NFL Unknown MLB NBA CFB Unknown Unknown Unknown"
).toDF("items")

val distinct = udf((xs: Seq[String]) => xs.distinct)

val items = df
  .withColumn("items", split($"items", "\\s+"))
  // Keep only distinct values
  .withColumn("items", distinct($"items"))


new FPGrowth().fit(items).freqItemsets.show
// +-------------------+----+
// |              items|freq|
// +-------------------+----+
// |              [MLB]|   2|
// |         [MLB, NFL]|   2|
// |[MLB, NFL, Unknown]|   2|
// |     [MLB, Unknown]|   2|
// |          [Unknown]|   5|
// |              [NFL]|   2|
// |     [NFL, Unknown]|   2|
// |          [Cricket]|   2|
// | [Cricket, Unknown]|   2|
// |              [CFB]|   2|
// |     [CFB, Unknown]|   2|
// +-------------------+----+