Pyspark:添加一列,其中包含基于当前的其他行值作为访问这些行的键

时间:2019-06-11 23:43:35

标签: pyspark

我有一个包含以下列的数据框:

DataFrame[timestamp: string, city_id: string, item_id: string, target_value: double, date: date, datestr: string, city_id: string, holiday_name: string, holiday_date: date, reference_date_id: date, hour_of_day: int]

我想创建一个名为ref_val的新列,该列具有来自具有相同city_id,hexcluster_id的另一行的值,但具有来自当前行的日期和小时的组合。此参考值应与相同city_id,hexcluster_id的目标值具有相同的值,但日期应与ref_date组合相同

例如:

+-------------------+-------+--------------------+------------+----------+----------+-------+--------------------+------------+-----------------+-----------+-----------+
|          timestamp|city_id|             item_id|target_value|      date|   datestr|city_id|        holiday_name|holiday_date|reference_date_id|hour_of_day|day_of_week|ref_val|
+-------------------+-------+--------------------+------------+----------+----------+-------+--------------------+------------+-----------------+-----------+-----------+
|2018-10-07 11:00:00|     10|0df9c29d-8776-436...|        92.0|2018-10-07|2018-10-07|     10|Columbus Day(shou...|  2018-10-07|       2017-10-08|         11|        Sun| 2
|2018-10-07 11:00:00|     10|0df9c29d-8776-436...|        92.0|2018-10-07|2018-10-07|     10|Columbus Day(shou...|  2018-10-07|       2017-10-08|         11|        Sun| 92

0 个答案:

没有答案