我已在下面的查询中成功加入userID
match
。
var queryToGroupCustomers = "SELECT yt.userID as player," +
" concat_ws(\",\", collect_set(match)) AS matchesPlayedOn" + //concat_ws()
" FROM recommendationengine.sportsbookbets_orc yt" +
" where yt.userID is not null " + leagueCondition + "'" +
" GROUP BY yt.userID"
现在我想将列传递给RDD以用于算法。我对此的实现是使用通用行格式val transactions: RDD[Array[String]] = results.rdd.map( row => row.get(2).toString.split(","))
但是给出了以下错误;
17/03/27 23:28:51 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 29)
java.lang.ArrayIndexOutOfBoundsException: 2
at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:200)
以下是联接数据集的示例;
ff6e96d4-e243-4046-8e02-ce3d4b459a5d Napoli - Crotone, AC Milan - Juventus, Torino - Juventus, AS Roma - AC Milan, Empoli - Bologna, AC Milan - Internazionale, Genoa - AC Milan, Sassuolo - Chievo Verona, Sassuolo - Genoa
我现在完全实现了算法,如下所示;
// Has all customers and their bets
var queryToGroupCustomers = "SELECT yt.userID as player," +
" concat_ws(\",\", collect_set(match)) AS matchesPlayedOn" + //concat_ws()
" FROM recommendationengine.sportsbookbets_orc yt" +
" where yt.userID is not null " + leagueCondition + "'" +
" GROUP BY yt.userID"
println("Executing query: \n\n" + queryToGroupCustomers)
var results = hc.sql(queryToGroupCustomers).cache()
val transactions: RDD[Array[String]] = results.rdd.map( row => row.get(2).toString.split(","))
// Set configurations for FP-Growth
val fpg = new FPGrowth()
.setMinSupport(0.5)
.setNumPartitions(10)
// Generate model
val model = fpg.run(transactions);
println("\n\n Starting FPGrowth\n\n")
model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
我很感激任何建议......谢谢
答案 0 :(得分:0)
你有一个包含2个字段的行,row.get(2)
获取其第三个字段的值(行中的字段通常为0,通常为);当然这是一个错误。要获得matchesPlayedOn
,请使用row.get(1)
或row(1)
。