Spark - 迭代数据帧中的所有行,将每行的多个列与另一行

时间:2017-07-31 05:39:00

标签: scala apache-spark spark-dataframe

| match_id | player_id | team | win |
|    0     |      1    |   A  |  A  |
|    0     |      2    |   A  |  A  |
|    0     |      3    |   B  |  A  |
|    0     |      4    |   B  |  A  |
|    1     |      1    |   A  |  B  |
|    1     |      4    |   A  |  B  |
|    1     |      8    |   B  |  B  |
|    1     |      9    |   B  |  B  |
|    2     |      8    |   A  |  A  |
|    2     |      4    |   A  |  A  |
|    2     |      3    |   B  |  A  |
|    2     |      2    |   B  |  A  |

我的数据框如上所示。

我需要创建一个map(key,value)对,以便每个

(k=>(player_id_1, player_id_2), v=> 1 ),如果player_id_1在比赛中胜过player_id_2

(k=>(player_id_1, player_id_2), v=> 0 ),如果player_id_1在比赛中输掉了player_id_2

因此我必须遍历整个数据帧,根据其他3列将每个玩家ID与另一个玩家ID进行比较。

我打算按照以下方式实现这一目标。

  1. 按match_id分组

  2. 在每个组中,对于其他player_id进行player_id检查以下

    一个。如果match_id相同且团队不同 然后

         if team =  win
           (k=>(player_id_1, player_id_2), v=> 0 )
         else team != win
           (k=>(player_id_1, player_id_2), v=> 1 )
    
  3. 例如,按匹配分区后,请考虑匹配1。 player_id 1需要与player_id 2,3和4进行比较。 在迭代时,将跳过player_id 2的记录,因为团队是相同的 对于player_id 3,因为团队与团队不同胜利将被比较。 由于player_id 1在A队中,player_id 3在B队,而A队赢得了形成的键值

    ((1,3),1)
    

    我对如何在命令式编程中实现这一点有了一个很好的想法,但我对scala和函数式编程非常陌生,并且无法获得关于如何迭代字段的每一行创建一个(关键字)的线索,值)通过检查其他字段来配对。

    我尽力解释这个问题。如果我的问题的任何部分不清楚,请告诉我。我很乐意解释同样的问题。谢谢。

    P.S:我正在使用Spark 1.6

1 个答案:

答案 0 :(得分:3)

这可以使用DataFrame API实现,如下所示。

Dataframe API版

val df = Seq((0,1,"A","A"),(0,2,"A","A"),(0,3,"B","A"),(0,4,"B","A"),(1,1,"A","B"),(1,4,"A","B"),(1,8,"B","B"),(1,9,"B","B"),(2,8,"A","A"),(2,4,"A","A"),(2,3,"B","A"),(2,2,"B","A")
).toDF("match_id", "player_id", "team", "win")

val result = df.alias("left")
       .join(df.alias("right"), $"left.match_id" === $"right.match_id" && not($"right.team" === $"left.team"))
       .select($"left.player_id", $"right.player_id", when($"left.team" === $"left.win", 1).otherwise(0).alias("flag"))

scala> result.collect().map(x => (x.getInt(0),x.getInt(1)) -> x.getInt(2)).toMap
res4: scala.collection.immutable.Map[(Int, Int),Int] = Map((1,8) -> 0, (3,4) -> 0, (3,1) -> 0, (9,1) -> 1, (4,1) -> 0, (8,1) -> 1, (2,8) -> 0, (8,3) -> 1, (1,9) -> 0, (1,4) -> 1, (8,2) -> 1, (4,9) -> 0, (3,2) -> 0, (1,3) -> 1, (4,8) -> 0, (4,2) -> 1, (2,4) -> 1, (8,4) -> 1, (2,3) -> 1, (4,3) -> 1, (9,4) -> 1, (3,8) -> 0)

SPARK SQL版

df.registerTempTable("data_table")

val result = sqlContext.sql("""
SELECT DISTINCT t0.player_id, t1.player_id, CASE WHEN t0.team == t0.win THEN 1 ELSE 0 END AS flag FROM data_table t0
INNER JOIN data_table t1
ON t0.match_id = t1.match_id
AND t0.team != t1.team
""")