如何在实现spark fp增长的同时获取RDD中的字符串值?

时间:2017-03-27 23:34:28

标签: scala apache-spark-mllib

我已在下面的查询中成功加入userID match

var queryToGroupCustomers = "SELECT yt.userID as player," +
  " concat_ws(\",\", collect_set(match)) AS matchesPlayedOn" + //concat_ws()
  " FROM recommendationengine.sportsbookbets_orc yt" +
  " where yt.userID is not null " + leagueCondition + "'" +
  " GROUP BY yt.userID"

现在我想将列传递给RDD以用于算法。我对此的实现是使用通用行格式val transactions: RDD[Array[String]] = results.rdd.map( row => row.get(2).toString.split(","))但是给出了以下错误;

17/03/27 23:28:51 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 29)
java.lang.ArrayIndexOutOfBoundsException: 2
    at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:200)

以下是联接数据集的示例;

ff6e96d4-e243-4046-8e02-ce3d4b459a5d    Napoli - Crotone, AC Milan - Juventus, Torino - Juventus, AS Roma - AC Milan, Empoli - Bologna, AC Milan - Internazionale, Genoa - AC Milan, Sassuolo - Chievo Verona, Sassuolo - Genoa

我现在完全实现了算法,如下所示;

// Has all customers and their bets
var queryToGroupCustomers = "SELECT yt.userID as player," +
  " concat_ws(\",\", collect_set(match)) AS matchesPlayedOn" + //concat_ws()
  " FROM recommendationengine.sportsbookbets_orc yt" +
  " where yt.userID is not null " + leagueCondition + "'" +
  " GROUP BY yt.userID"

println("Executing query: \n\n" + queryToGroupCustomers)
var results = hc.sql(queryToGroupCustomers).cache()
val transactions: RDD[Array[String]] = results.rdd.map( row => row.get(2).toString.split(","))

// Set configurations for FP-Growth
val fpg = new FPGrowth()
  .setMinSupport(0.5)
  .setNumPartitions(10)

// Generate model
val model = fpg.run(transactions);

println("\n\n Starting FPGrowth\n\n")

model.freqItemsets.collect().foreach { itemset =>
  println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}

我很感激任何建议......谢谢

1 个答案:

答案 0 :(得分:0)

你有一个包含2个字段的行,row.get(2)获取其第三个字段的值(行中的字段通常为0,通常为);当然这是一个错误。要获得matchesPlayedOn,请使用row.get(1)row(1)