Pyspark:如何根据另一个数据框中的日期将另一个值应用于数据框

时间:2019-10-05 13:29:27

标签: python dataframe apache-spark pyspark

我的第一个数据框df包含start_date和value,第二个数据框df_v仅包含日期。

我的df

+-------------------+-----+
|      start_date   |value|
+-------------------+-----+
|2019-03-17 00:00:00|   35|
+-------------------+-----+
|2019-05-20 00:00:00|   40|
+-------------------+-----+
|2019-06-03 00:00:00|   10|
+-------------------+-----+
|2019-07-01 00:00:00|   12|
+-------------------+-----+

我的df_v

+-------------------+
|       date        |
+-------------------+
|2019-02-01 00:00:00|
+-------------------+
|2019-04-10 00:00:00|
+-------------------+
|2019-06-14 00:00:00|   
+-------------------+

我想要的是新的df_v

+-------------------+-------------+
|       date        |   v_value   |
+-------------------+-------------+
|2019-02-01 00:00:00|            0|
+-------------------+-------------+
|2019-04-10 00:00:00|    (0+35) 35|
+-------------------+-------------+
|2019-06-14 00:00:00|(35+40+10) 85|
+-------------------+-------------+

尝试这样工作:

df=df.withColumn("lead",lead(F.col("start_date"),1).over(Window.orderBy("start_date")))

for r_v in df_v.rdd.collect():
    for r in df.rdd.collect():
        if (r_v.date >= r.start_date) and (r_v.date < r.lead):
            df_v = df_v.withColumn('v_value', 
            ...

1 个答案:

答案 0 :(得分:1)

这可以通过join和聚合来完成。

from pyspark.sql.functions import sum,when
#Join
joined_df = df_v.join(df,df.start_date <= df_v.date,'left')
joined_df.show() #View the joined result 
#Aggregation
joined_df \
.groupBy(joined_df.date) \
.agg(sum((when(joined_df.value.isNull(),0).otherwise(joined_df.value))).alias('val')) \
.show()