标量,过滤器RDD

时间:2019-06-04 09:04:20

标签: scala apache-spark rdd

我有val:

val key: RDD[String]= Seq("0000005","0000001","0000007").toRDD

val file2: Array[String] = Array(("0000005", 82, 79, 16, 21, 80),
("0000001", 46, 39, 8, 5, 21), 
("0000004", 58, 71, 20, 10, 6),
("0000009", 60, 89 33 18 6),
("0000003", 30, 50, 71, 36, 30),
("0000007", 50, 2, 33, 15, 62))

我想在file2中过滤“键”中存在的元素

我想要这样的东西:

0000005 82 79 16 21 80
0000001 46 39 8 5 21
0000007 50 2 33 15 62

2 个答案:

答案 0 :(得分:0)

我将其简化为标准的#replace in df from list def replaceCell(mylist,myval,mycol,mydf): for i in range(len(mylist)): mydf.mycol.replace(to_replace=mylist[i],value=myval,inplace=True) return mydf replaceCell((c1,c2,c3,c4,c5,c6,c7),c0,'SCity',cimsBid) 集合类型:

Scala

以下是过滤器函数,可提供您的结果:

val keys = Seq("0000005","0000001","0000007")

val all = Seq("0000005 82 79 16 21 80",
"0000001 46 39 8 5 21", 
"0000004 58 71 20 10 6",
"0000009 60 89 33 18 6",
"0000003 30 50 71 36 30",
"0000007 50 2 33 15 62")

请参见Scalafiddle

答案 1 :(得分:0)

首先,需要将file2映射为键->值结构:(我假设file2中的所有数字实际上都是字符串。):

val file2Map: RDD[(String, Array[String])] = file2.map(value => (value.head, value)).toRDD

现在,如果您这样做:     keys.join(file2Map).take(10).foreach(println)

输出类似于:

(0000005, (0000005, 0000005 82 79 16 21 80)
(0000001, (0000001, 0000001 46 39 8 5 21)
(0000007, (0000001, 0000001 50 2 33 15 62)

从那里很容易从值中仅获取第二个元组。