我试图使用apache spark intersection方法找到两个RDD String的交集,但它返回空数组。
val d=sc.parallelize(Seq("web services as a software","RCB vs CSK"))
val d1 = sc.parallelize(Seq("software as a services", "CSK vs RCB"))
d.intersection(d1).collect
输出
res6:Array [String] = Array()
答案 0 :(得分:1)
您缺少将句子分成单词的部分:
val d=sc.parallelize(Seq("web services as a software","RCB vs CSK")).flatMap(_.split(" "))
val d1 = sc.parallelize(Seq("software as a services", "CSK vs RCB")).flatMap(_.split(" "))
d.intersection(d1).collect