我正在研究Apache-Spark项目。我有一个Amazon产品评论的数据集。每个元素都有用户ID,产品ID,得分,帮助程度等字段-与我认为的问题并不真正相关。
首先,我必须创建一个RDD,其中包含相对于特定productId的元组;特别是,最终的帮助不仅是用户在该评论中获得的帮助,还包括其他用户的平均评价。
然后,我想为每个用户计算所有产品的平均最终帮助度。计算相对于单个产品的结果的函数是pageRankOneMovie
。我虽然解决方案是在productId集合上使用flatMap,像这样
val userHelpfulnessRankings = moviesProductId.flatMap(pageRankOneMovie(movies, _).collect.toList)
但是我遇到错误SPARK-5063,因为通过在flatMap中调用pageRankOneMovie
来嵌套转换。
我已经研究了一些有关广播变量和累加器的知识,我认为我可以构建一些有用的东西。但是,我想知道是否有针对我的问题的特定解决方案,因为它对我来说真的很简单:我需要以编程方式创建一系列RDD,然后将它们合并在一起。
作为参考,这是我要运行的程序(编译正常,出现5063运行时错误):
object PageRank {
def pageRankOneMovie(movies : RDD[Movie], productId : String) : RDD[(String, Double)] = {
val helpfulness = userHelpfulness(movies)
.filter { case (_,value) => !value.isEmpty }
.mapValues { _.get}
val average = helpfulnessByScore(movies, productId)
val reviews = movies.filter(_.productId == productId).map( mov => (mov.userId, mov.score))
val reviewHelpfulness = reviews.join(helpfulness).map { case (id, (score, help)) => (score, (id, help)) }
reviewHelpfulness.join(average).map {
case (score, ((id, help), averageHelpfulness)) =>
(id, if (help < averageHelpfulness) (help+averageHelpfulness)/2 else help)
}
}
def compute(movies: RDD[Movie], context: SparkContext) : RDD[(String, Double)] = {
val moviesProductId = movies.map(_.productId).distinct
val userHelpfulnessRankings = moviesProductId.flatMap(pageRankOneMovie(movies, _).collect.toList)
val average = userHelpfulnessRankings
.aggregateByKey((0.0,0)) ((acc, value) => (acc._1+value, acc._2+1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
average.map { case (userId, acc) => (userId, acc._1/acc._2) }
}
}
答案 0 :(得分:0)
好的,似乎没有通用的解决方案可以解决此问题。显然只有两种方法可以解决这种情况:
collect
导致一个for循环,然后从那里继续工作,或者由于第一个解决方案需要从工人到驾驶员收集可能大量的数据,所以我选择了第二个想法。
基本上,不是从一开始就将一个productId隔离开来,而是使用(score,productId)元组作为键,同时跟踪多个电影。最后的功能如下。
def pageRankAllMovies(movies : RDD[Movie]) = {
// Helpfulness media degli utenti
// (userId, helpfulness (tra 0 e 1))
val helpfulness = userHelpfulness(movies)
.filter { case (_,value) => !value.isEmpty }
.mapValues { _.get}
// Helpfulness media delle review per film in base allo score assegnato
// ((score, productId), helpfulness) per un singolo productId
val average = helpfulnessByScore(movies)
val reviews = movies.map( mov => (mov.userId, (mov.score, mov.productId)))
val reviewHelpfulness = reviews.join(helpfulness).map { case (id, (score, help)) => (score, (id, help)) }
// Per ogni "gruppo" di review di uno stesso film che assegnano lo stesso score tiro su
// la helpfulness degli utenti in base alla media del film
val globalUserHelpfulness = reviewHelpfulness.join(average).map {
case (score, ((id, help), averageHelpfulness)) =>
(id, if (help < averageHelpfulness) (help+averageHelpfulness)/2 else help)
}
// Se consideriamo piu' di un film alla fine ci sono piu' valori di helpfulness
// per ogni utente. Si fa la media
globalUserHelpfulness.aggregateByKey((0.0,0)) ((acc, value) => (acc._1+value, acc._2+1), (acc1,acc2) => (acc1._1 + acc2._1, acc1._2+ acc2._2))
.map { case (userId, help) => (userId, help._1/help._2) }
}
tl; dr:要么collect
循环执行所有结果,要么按一个转换序列管理所有计算。