借助Spark Scala将帖子关联到标签

时间:2017-03-12 16:21:23

标签: scala apache-spark rdd

如何将博客帖子与Spark中的标签相关联?

val posts = Seq("BMW is a good car", 
"AUDI beats Tesla on speed race", 
"BMW exposes its new vehicle at Montreal", 
"Mercedes introduces beast offroad track")

val rdd = sc.makeRDD(posts)

val tags = Seq("BMW", "AUDI", "Mercedes")

基于上面的数据,我想获得新的RDD[(String, Iterable[String]]

("宝马",Iterable("宝马是一辆好车","宝马在蒙特利尔公开其新车")

("奥迪",Iterable("奥迪在速度赛上击败特斯拉")

(" Mercedes",Iterable(" Mercedes介绍野兽越野赛道")

有任何想法可以做到吗?

1 个答案:

答案 0 :(得分:2)

// broadcast the tags
val tags_broadcast = sc.broadcast(tags)

// extract the tags each string contains in the rdd, make a pair rdd where the first element 
// is the tag and second element is the string, then call groupByKey method
rdd.flatMap(s => tags_broadcast.value.filter(s.contains(_)).map((_, s))).groupByKey.collect

// res110: Array[(String, Iterable[String])] = Array((AUDI,CompactBuffer(AUDI beats Tesla on speed race)), (BMW,CompactBuffer(BMW is a good car, BMW exposes its new vehicle at Montreal)), (Mercedes,CompactBuffer(Mercedes introduces beast offroad track)))