我有两个数据框:
df1 =
| city | timestamp | value |
---------------------------------
| a | 2019-01-01 00:00:00 | 1 |
| a | 2018-01-01 00:00:00 | 2 |
| b | 2018-01-01 10:00:00 | 1 |
| b | 2018-01-01 20:00:00 | 3 |
| c | 2019-01-01 10:00:00 | 2 |
| a | 2018-01-01 20:00:00 | 5 |
| c | 2018-01-01 10:00:00 | 7 |
| b | 2017-01-01 20:00:00 | 10 |
df2 =
| city | timestamp | value | ref_timestamp
---------------------------------
| a | 2019-01-01 00:00:00 | 1 | 2018-01-01 00:00:00
| a | 2019-01-01 20:00:00 | 2 | 2018-01-01 20:00:00
| b | 2019-01-01 10:00:00 | 1 | 2018-01-01 10:00:00
| b | 2018-01-01 20:00:00 | 3 | 2017-01-01 20:00:00
| c | 2019-01-01 10:00:00 | 2 | 2018-01-01 10:00:00
我需要加入这两个数据框以获得以下df
df3 =
| city | timestamp | value | ref_timestamp | ref_value
---------------------------------
| a | 2019-01-01 00:00:00 | 1 | 2018-01-01 00:00:00 | 2
| a | 2019-01-01 20:00:00 | 2 | 2018-01-01 20:00:00 | 5
| b | 2019-01-01 10:00:00 | 1 | 2018-01-01 10:00:00 | 1
| b | 2018-01-01 20:00:00 | 3 | 2017-01-01 20:00:00 | 10
| c | 2019-01-01 10:00:00 | 2 | 2018-01-01 10:00:00 | 7
基本上,它使用ref_timestamp并在timestamp列的df1中查询它并获取其值。
答案 0 :(得分:0)
df1 = df1.withColumnRenamed(“ value”,“ ref_value”)。withColumnRenamed(“ timestamp”,“ ref_timestamp”)
df3 = df1.join(df2,[“ city”,“ ref_timestamp”],“ leftouter”)。其中(“ timestamp不为空且值不为空”)
结果:df3.show()
|city| ref_timestamp|ref_value| timestamp|value|
+----+-------------------+---------+-------------------+-----+
| a|2018-01-01 00:00:00| 2|2019-01-01 00:00:00| 1|
| a|2018-01-01 20:00:00| 5|2019-01-01 20:00:00| 2|
| b|2018-01-01 10:00:00| 1|2019-01-01 10:00:00| 1|
| b|2017-01-01 20:00:00| 10|2018-01-01 20:00:00| 3|
| c|2018-01-01 10:00:00| 7|2019-01-01 10:00:00| 2|
+----+-------------------+---------+-------------------+-----+```
答案 1 :(得分:0)
加入城市和时间戳记应该做到这一点:
df3 = df1.join(df2, (df1['city'] == df2['city']) and df1['time_stamp'] == df2['ref_tikme_stamp']
然后,您只需要重命名/删除列即可获得所需的名称。
答案 2 :(得分:0)
因此,以下内容对我有用,它避免了列列表中的重复项。
df1 = df1.withColumnRenamed('timestamp', 'ref_timestamp').withColumnRenamed('value', 'ref_value')
df2 = df2.withColumnRenamed('timestamp', 'ref_timestamp')
df3 = df2.join(df1, ['city_id', 'ref_timestamp'])