如何将博客帖子与Spark中的标签相关联?
val posts = Seq("BMW is a good car",
"AUDI beats Tesla on speed race",
"BMW exposes its new vehicle at Montreal",
"Mercedes introduces beast offroad track")
val rdd = sc.makeRDD(posts)
val tags = Seq("BMW", "AUDI", "Mercedes")
基于上面的数据,我想获得新的RDD[(String, Iterable[String]]
:
("宝马",Iterable("宝马是一辆好车","宝马在蒙特利尔公开其新车")
("奥迪",Iterable("奥迪在速度赛上击败特斯拉")
(" Mercedes",Iterable(" Mercedes介绍野兽越野赛道")
有任何想法可以做到吗?
答案 0 :(得分:2)
// broadcast the tags
val tags_broadcast = sc.broadcast(tags)
// extract the tags each string contains in the rdd, make a pair rdd where the first element
// is the tag and second element is the string, then call groupByKey method
rdd.flatMap(s => tags_broadcast.value.filter(s.contains(_)).map((_, s))).groupByKey.collect
// res110: Array[(String, Iterable[String])] = Array((AUDI,CompactBuffer(AUDI beats Tesla on speed race)), (BMW,CompactBuffer(BMW is a good car, BMW exposes its new vehicle at Montreal)), (Mercedes,CompactBuffer(Mercedes introduces beast offroad track)))