初学者,我在使用Spark 2.1.1和Scala 2.11.8。
我有一个有六列的RDD。这是RDD的第一个条目: -
(String, String, String, String, String, String) = (" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")
实际的RDD有超过500万个条目。
我想将第一列与第三列,第四列,第五列和第六列分别映射,以便得到如下内容: -
(fb_406423006398063, p69465323_serv80i)
(guest_861067032060185_android, p69465323_serv80i)
(fb_100000829486587, p69465323_serv80i)
(fb_100007900293502, p69465323_serv80i)
,即第一列分别与第三,第四,第五和第六列映射。我该怎么做(a)在同一个RDD中(b)在不同的RDD中?
答案 0 :(得分:2)
考虑到你有一个元组数组,其中每个元素都是:
(" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")
您可以使用以下内容:
val rdd = sc.parallelize(Array((" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")))
val pairedRdd = rdd.map(x => ((x._1, x._3), (x._1, x._4), (x._1, x._5), (x._1, x._6)) )
pairedRdd.collect
Array[((String, String), (String, String), (String, String), (String, String))] = Array(((" p69465323_serv80i"," fb_406423006398063"),(" p69465323_serv80i"," guest_861067032060185_android"),(" p69465323_serv80i"," fb_100000829486587"),(" p69465323_serv80i"," fb_100007900293502")))
答案 1 :(得分:0)
${delCount}= Set Variable 0
:FOR ${loopIndex} IN RANGE 0 8
\ Log ${loopIndex}
\ ${delCount}= Run Keyword If ${loopIndex} == 3 Evaluate ${loopIndex} + ${delCount}
\ ... ELSE IF ${loopIndex} == 6 Evaluate ${delCount} + 6
\ ... ELSE Sleep 1s
Log ${delCount}
使用变量声明 YourModelClass 对象或类:firstCol,secondCol,... fiftCol。
我希望能帮到你