交叉点不在apache火花中工作

时间:2016-05-02 16:20:08

标签: java scala apache-spark

我试图使用apache spark intersection方法找到两个RDD String的交集,但它返回空数组。

val d=sc.parallelize(Seq("web services as a software","RCB vs CSK"))

val d1 = sc.parallelize(Seq("software as a services", "CSK vs RCB"))

d.intersection(d1).collect
  

输出

     

res6:Array [String] = Array()

1 个答案:

答案 0 :(得分:1)

您缺少将句子分成单词的部分:

val d=sc.parallelize(Seq("web services as a software","RCB vs CSK")).flatMap(_.split(" "))

val d1 = sc.parallelize(Seq("software as a services", "CSK vs RCB")).flatMap(_.split(" "))

d.intersection(d1).collect