根据spark中第2个rdd的值减去rdd的行数

时间:2015-11-17 10:16:30

标签: apache-spark rdd

我有两个RDD名称releventResultsranoms

releventResults包含以下数据:

2:DestIP:173.194.116.42,1:SrIP:172.20.16.121,3:DestPort:80,=>4:Time_Range:11:00-12:00 = 1.0
2:DestIP:172.20.16.4,1:SrIP:172.20.16.51,3:DestPort:80,=>4:Time_Range:16:00-17:00 = 0.13
2:DestIP:216.92.251.5,4:Time_Range:10:00-11:00,3:DestPort:80,=>1:SrIP:172.20.16.64 = 1.0
2:DestIP:172.20.16.9,1:SrIP:172.20.16.82,3:DestPort:80,=>4:Time_Range:17:00-18:00 = 0.13
2:DestIP:190.93.247.58,1:SrIP:172.20.16.102,4:Time_Range:12:00-13:00,=>3:DestPort:80 = 1.0
2:DestIP:140.98.193.112,1:SrIP:172.20.16.110,3:DestPort:80,=>4:Time_Range:15:00-16:00 = 0.9
2:DestIP:91.189.92.201,1:SrIP:172.20.16.58,3:DestPort:80,=>4:Time_Range:11:00-12:00 = 1.0
1:SrIP:172.20.16.121,4:Time_Range:09:00-10:00,3:DestPort:80,=>2:DestIP:199.27.79.196 = 0.03
1:SrIP:172.20.16.111,4:Time_Range:10:00-11:00,3:DestPort:80,=>2:DestIP:185.31.19.196 = 0.01
2:DestIP:88.221.48.112,1:SrIP:172.20.16.107,4:Time_Range:16:00-17:00,=>3:DestPort:80 = 1.0
1:SrIP:172.20.16.60,2:DestIP:91.189.92.152,3:DestPort:80,=>4:Time_Range:07:00-8:00 = 1.0
4:Time_Range:14:00-15:00,1:SrIP:172.20.16.51,3:DestPort:80,=>2:DestIP:172.20.16.7 = 0.15
2:DestIP:172.20.16.10,1:SrIP:172.20.16.82,4:Time_Range:11:00-12:00,=>3:DestPort:3910 = 1.0
2:DestIP:198.252.206.16,4:Time_Range:12:00-13:00,1:SrIP:172.20.16.106,=>3:DestPort:80 = 1.0
2:DestIP:23.235.43.130,4:Time_Range:13:00-14:00,3:DestPort:80,=>1:SrIP:172.20.16.106 = 1.0
1:SrIP:172.20.16.76,2:DestIP:172.20.16.64,4:Time_Range:17:00-18:00,=>3:DestPort:2869 = 1.0

和ranoms1包含:

1:SrIP:172.20.16.103 2:DestIP:54.225.129.170 3:DestPort:80 4:Time_Range:12:00-13:00
1:SrIP:172.20.16.89 2:DestIP:172.20.16.83 3:DestPort:5357 4:Time_Range:12:00-13:00
1:SrIP:172.20.16.105 2:DestIP:110.93.194.234 3:DestPort:80 4:Time_Range:12:00-13:00
1:SrIP:172.20.16.84 2:DestIP:172.20.16.64 3:DestPort:2869 4:Time_Range:12:00-13:00
1:SrIP:172.20.16.96 2:DestIP:82.178.158.26 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.105 2:DestIP:82.163.79.170 3:DestPort:80 4:Time_Range:10:00-11:00
1:SrIP:172.20.16.115 2:DestIP:92.122.48.122 3:DestPort:80 4:Time_Range:10:00-11:00
1:SrIP:172.20.16.105 2:DestIP:46.102.243.70 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.51 2:DestIP:216.34.181.59 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.31 2:DestIP:95.101.72.17 3:DestPort:80 4:Time_Range:10:00-11:00
1:SrIP:172.20.16.51 2:DestIP:54.75.236.43 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.103 2:DestIP:68.232.34.200 3:DestPort:80 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.89 2:DestIP:172.20.16.34 3:DestPort:5357 4:Time_Range:11:00-12:00
1:SrIP:172.20.16.124 2:DestIP:107.20.214.255 3:DestPort:80 4:Time_Range:11:00-12:00

我有以下代码:

 var finalRanoms = ranoms1
   .filter(_.split.map(p=> p(0)+" "+p(1)+" "+p(2)+" "+p(3)
     (releventResults.map(x=>x.contains(p(1))))))

我想过滤ranoms1第二个元素DestIP在相关结果中包含的那些行

1 个答案:

答案 0 :(得分:0)

RDD中的过滤器旨在满足给定RDD的谓词 您可以使用RDD.intersection API获取包含相同元素的结果RDD 您可以使用RDD.subtract / RDD.subtractByKey API获取包含“A - B Set”元素的结果RDD。

val duplicates = rdd1.intersection(rdd2)  
val nonDuplicates = rdd1.subtract(rdd2)
val nonDuplicatesByKey = rdd1.subtractByKey(rdd2)

要通过rdd1中存在的IP过滤rdd2,我会将两者转换为键值RDD(其中IP用作键),然后按键减去:

val rdd1Pairs = rdd1.map(x => (getIpKeyFromRdd1(x), x))
val rdd2Pairs = rdd2.map(x => (getIpKeyFromRdd2(x), x))
val nonDuplicatesByKey = rdd2Pairs.subtractByKey(rdd1Pairs)
val rdd2Filtered = nonDuplicatesByKey.values()

您必须实施 getIPKeysFromRdd()