我可以在两个Spark DStream上进行JOIN,例如:
val joinStream = stream1.join(stream2)
现在,如果我需要过滤掉所有未加入的记录,该怎么办?基本上,像stream1.anti-join(stream2)
之类的东西。这有可能吗?
感谢并感谢任何帮助!
答案 0 :(得分:2)
假设你有这些:
val rdd1 = sc.parallelize(Array(
(1, "one"),
(2, "twow"),
(3, "three"),
(4, "four"),
(5, "five")
))
val rdd2 = sc.parallelize(Array(
(1, "otherOne"),
(4, "otherFour"),
(5,"otherFive"),
(6,"six"),
(7,"seven")
))
val antiJoined = rdd1.fullOuterJoin(rdd2).filter(r => r._2._1.isEmpty || r._2._2.isEmpty)
antiJoined.collect foreach println
(6,(None,Some(six)))
(2,(Some(twow),None))
(3,(Some(three),None))
(7,(None,Some(seven)))