我有val:
val key: RDD[String]= Seq("0000005","0000001","0000007").toRDD
和
val file2: Array[String] = Array(("0000005", 82, 79, 16, 21, 80), ("0000001", 46, 39, 8, 5, 21), ("0000004", 58, 71, 20, 10, 6), ("0000009", 60, 89 33 18 6), ("0000003", 30, 50, 71, 36, 30), ("0000007", 50, 2, 33, 15, 62))
我想在file2中过滤“键”中存在的元素
我想要这样的东西:
0000005 82 79 16 21 80 0000001 46 39 8 5 21 0000007 50 2 33 15 62
答案 0 :(得分:0)
我将其简化为标准的#replace in df from list
def replaceCell(mylist,myval,mycol,mydf):
for i in range(len(mylist)):
mydf.mycol.replace(to_replace=mylist[i],value=myval,inplace=True)
return mydf
replaceCell((c1,c2,c3,c4,c5,c6,c7),c0,'SCity',cimsBid)
集合类型:
Scala
以下是过滤器函数,可提供您的结果:
val keys = Seq("0000005","0000001","0000007")
val all = Seq("0000005 82 79 16 21 80",
"0000001 46 39 8 5 21",
"0000004 58 71 20 10 6",
"0000009 60 89 33 18 6",
"0000003 30 50 71 36 30",
"0000007 50 2 33 15 62")
请参见Scalafiddle
答案 1 :(得分:0)
首先,需要将file2
映射为键->值结构:(我假设file2中的所有数字实际上都是字符串。):
val file2Map: RDD[(String, Array[String])] = file2.map(value => (value.head, value)).toRDD
现在,如果您这样做: keys.join(file2Map).take(10).foreach(println)
输出类似于:
(0000005, (0000005, 0000005 82 79 16 21 80)
(0000001, (0000001, 0000001 46 39 8 5 21)
(0000007, (0000001, 0000001 50 2 33 15 62)
从那里很容易从值中仅获取第二个元组。