Spark:反加入两个DStream

时间:2016-04-15 12:14:04

标签: scala apache-spark apache-spark-sql spark-streaming

我可以在两个Spark DStream上进行JOIN,例如:

val joinStream = stream1.join(stream2)

现在,如果我需要过滤掉所有未加入的记录,该怎么办?基本上,像stream1.anti-join(stream2)之类的东西。这有可能吗?

感谢并感谢任何帮助!

1 个答案:

答案 0 :(得分:2)

假设你有这些:

val rdd1 = sc.parallelize(Array(
  (1, "one"),
  (2, "twow"),
  (3, "three"),
  (4, "four"),
  (5, "five")
))
val rdd2 = sc.parallelize(Array(
  (1, "otherOne"),
  (4, "otherFour"),
  (5,"otherFive"),
  (6,"six"),
  (7,"seven")
))

val antiJoined = rdd1.fullOuterJoin(rdd2).filter(r => r._2._1.isEmpty || r._2._2.isEmpty)

antiJoined.collect foreach println
(6,(None,Some(six)))
(2,(Some(twow),None))
(3,(Some(three),None))
(7,(None,Some(seven)))