基于列组合两个Spark数据帧

时间:2019-07-15 17:16:11

标签: dataframe pyspark

我有两个数据框:

df1 = 

    | city    | timestamp           | value |
     ---------------------------------
    |  a      | 2019-01-01 00:00:00 |  1    | 
    |  a      | 2018-01-01 00:00:00 |  2    |
    |  b      | 2018-01-01 10:00:00 |  1    | 
    |  b      | 2018-01-01 20:00:00 |  3    |
    |  c      | 2019-01-01 10:00:00 |  2    |
    |  a      | 2018-01-01 20:00:00 |  5    |
    |  c      | 2018-01-01 10:00:00 |  7    |
    |  b      | 2017-01-01 20:00:00 |  10   |


df2 = 

    | city    | timestamp           | value | ref_timestamp
     ---------------------------------
    |  a      | 2019-01-01 00:00:00 |  1    | 2018-01-01 00:00:00
    |  a      | 2019-01-01 20:00:00 |  2    | 2018-01-01 20:00:00
    |  b      | 2019-01-01 10:00:00 |  1    | 2018-01-01 10:00:00
    |  b      | 2018-01-01 20:00:00 |  3    | 2017-01-01 20:00:00
    |  c      | 2019-01-01 10:00:00 |  2    | 2018-01-01 10:00:00

我需要加入这两个数据框以获得以下df

df3 = 

    | city    | timestamp           | value | ref_timestamp        | ref_value
     ---------------------------------
    |  a      | 2019-01-01 00:00:00 |  1    | 2018-01-01 00:00:00  | 2
    |  a      | 2019-01-01 20:00:00 |  2    | 2018-01-01 20:00:00  | 5
    |  b      | 2019-01-01 10:00:00 |  1    | 2018-01-01 10:00:00  | 1
    |  b      | 2018-01-01 20:00:00 |  3    | 2017-01-01 20:00:00  | 10
    |  c      | 2019-01-01 10:00:00 |  2    | 2018-01-01 10:00:00  | 7

基本上,它使用ref_timestamp并在timestamp列的df1中查询它并获取其值。

3 个答案:

答案 0 :(得分:0)

df1 = df1.withColumnRenamed(“ value”,“ ref_value”)。withColumnRenamed(“ timestamp”,“ ref_timestamp”)

df3 = df1.join(df2,[“ city”,“ ref_timestamp”],“ leftouter”)。其中(“ timestamp不为空且值不为空”)

结果:df3.show()

|city|      ref_timestamp|ref_value|          timestamp|value|
+----+-------------------+---------+-------------------+-----+
|   a|2018-01-01 00:00:00|        2|2019-01-01 00:00:00|    1|
|   a|2018-01-01 20:00:00|        5|2019-01-01 20:00:00|    2|
|   b|2018-01-01 10:00:00|        1|2019-01-01 10:00:00|    1|
|   b|2017-01-01 20:00:00|       10|2018-01-01 20:00:00|    3|
|   c|2018-01-01 10:00:00|        7|2019-01-01 10:00:00|    2|
+----+-------------------+---------+-------------------+-----+```

答案 1 :(得分:0)

加入城市和时间戳记应该做到这一点:

df3 = df1.join(df2, (df1['city'] == df2['city']) and df1['time_stamp'] == df2['ref_tikme_stamp']

然后,您只需要重命名/删除列即可获得所需的名称。

答案 2 :(得分:0)

因此,以下内容对我有用,它避免了列列表中的重复项。

df1 = df1.withColumnRenamed('timestamp', 'ref_timestamp').withColumnRenamed('value', 'ref_value')
df2 = df2.withColumnRenamed('timestamp', 'ref_timestamp')
df3 = df2.join(df1, ['city_id', 'ref_timestamp'])