我有一个数据框df1:
+-------------------+-----+
| start_date |value|
+-------------------+-----+
|2019-03-17 00:00:00| 35|
+-------------------+-----+
|2019-05-20 00:00:00| 40|
+-------------------+-----+
|2019-06-03 00:00:00| 10|
+-------------------+-----+
|2019-07-01 00:00:00| 12|
+-------------------+-----+
和另一个数据框df_date:
+-------------------+
| date |
+-------------------+
|2019-02-01 00:00:00|
+-------------------+
|2019-04-10 00:00:00|
+-------------------+
|2019-06-14 00:00:00|
+-------------------+
我进行了连接,现在我有了带有date,start_date和value的df,但是我想要的值应该是这样的:
+-------------------+-------------------+-----+
| start_date | date |value|
+-------------------+-------------------+-----+
|2019-02-01 00:00:00|2019-03-17 00:00:00| 0|
+-------------------+-------------------+-----+
|2019-04-10 00:00:00|2019-05-20 00:00:00| 35|
+-------------------+-------------------+-----+
|2019-06-14 00:00:00|2019-06-03 00:00:00| 85|
+-------------------+-------------------+-----+
每次我都应该将start_date与date进行比较(如果不同),我应该将前一个值与我的值相加,否则我应该保留前一个值
我已经有了在Pyspark中加入连接的新数据框,并尝试具有新值
我使用此代码来获取结果
win = Window.partitionBy().orderBy("date")
df = df.withColumn("prev_date", F.lag(F.col("start_date")).over(win))
df = df.fillna({'prev_date': 0})
df = df.withColumn("value",F.when(F.isnull( F.lag(F.col("value"), 1).over(win)),df.value).when(df.start_date != df.prev_date,df.value + F.lag(F.col("value"), 1).over(win)) .otherwise(F.lag(F.col("value"),1).over(win)))
df.show(df.count(),False)
修改是在同一时间完成的,每次我都需要先前的值
谢谢
答案 0 :(得分:1)
这里有一些代码可以满足您的需求。
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# step 1: init dataframes
cols = ["start_date", "value"]
data = [["2019-03-17 00:00:00", 35],
["2019-05-20 00:00:00", 40],
["2019-06-03 00:00:00", 10],
["2019-07-01 00:00:00", 12],
]
df = spark.createDataFrame(data, cols)
additional_dates = spark.createDataFrame([["2019-02-01 00:00:00"], ["2019-04-10 00:00:00"], ["2019-06-14 00:00:00"]], ["date"])
# step 2 calculate correct values.
# This is done by joining the df to the additinal dates and summing all values per 'date'
additional_dates = additional_dates.join(df, F.col("date") > F.col("start_date"), "left_outer").fillna(0, subset="value")
additional_dates = additional_dates.groupBy("date").agg(F.sum("value").alias("value"))
# at this point you already have 'date' + the correct value. you only need to join back the original date column
# step 3 get back the original date column
# we do this by joining on the row_number
# note that spark does not have an easy operation for adding a column from another dataframe
window_df = Window.orderBy("start_date")
window_add = Window.orderBy("date")
df = df.withColumn("row_number", F.row_number().over(window_df))
additional_dates = additional_dates.withColumn("row_number", F.row_number().over(window_add))
df = df.drop("value").join(additional_dates, "row_number").drop("row_number")
df.show()