如何将一列RDD映射到(a)相同(b)不同RDD的其他列?

时间:2017-06-29 10:05:01

标签: scala apache-spark

初学者,我在使用Spark 2.1.1和Scala 2.11.8。

我有一个有六列的RDD。这是RDD的第一个条目: -

(String, String, String, String, String, String) = (" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")

实际的RDD有超过500万个条目。

我想将第一列与第三列,第四列,第五列和第六列分别映射,以便得到如下内容: -

(fb_406423006398063, p69465323_serv80i)
(guest_861067032060185_android, p69465323_serv80i)
(fb_100000829486587, p69465323_serv80i)
(fb_100007900293502, p69465323_serv80i)

,即第一列分别与第三,第四,第五和第六列映射。我该怎么做(a)在同一个RDD中(b)在不同的RDD中?

2 个答案:

答案 0 :(得分:2)

考虑到你有一个元组数组,其中每个元素都是:

(" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")

您可以使用以下内容:

val rdd = sc.parallelize(Array((" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")))
val pairedRdd = rdd.map(x => ((x._1, x._3), (x._1, x._4), (x._1, x._5), (x._1, x._6)) )
pairedRdd.collect
Array[((String, String), (String, String), (String, String), (String, String))] = Array(((" p69465323_serv80i"," fb_406423006398063"),(" p69465323_serv80i"," guest_861067032060185_android"),(" p69465323_serv80i"," fb_100000829486587"),(" p69465323_serv80i"," fb_100007900293502")))

答案 1 :(得分:0)

${delCount}=    Set Variable    0
:FOR    ${loopIndex}    IN RANGE    0    8
\    Log    ${loopIndex}
\    ${delCount}=    Run Keyword If    ${loopIndex} == 3    Evaluate    ${loopIndex} + ${delCount}
\    ...    ELSE IF    ${loopIndex} == 6    Evaluate    ${delCount} + 6
\    ...    ELSE    Sleep    1s
Log    ${delCount}

使用变量声明 YourModelClass 对象或类:firstCol,secondCol,... fiftCol。

我希望能帮到你