在df1中我有
+--------+----------+----------+
| ID1|start_date| stop_date|
+--------+----------+----------+
|50194077|2012-05-22|2012-05-25|
|50194077|2012-05-19|2012-05-22|
|50194077|2012-06-15|2012-06-18|
|50127135|2016-05-12|2016-05-15|
...
+--------+----------+----------+
在df2中我有
+----------+-------------------+------------------+
| ID2| date| X|
+----------+-------------------+------------------+
| 50127135|2016-06-10 00:00:00| 24.14699999999999|
| 50127135|2015-08-01 00:00:00|17.864999999999995|
| 50127135|2015-05-10 00:00:00|1.6829999999999998|
| 50127135|2014-07-02 00:00:00| 5.301000000000002|
...
+----------+-------------------+------------------+
我想在d_s1中添加一个名为X_sum的列,其中包含满足条件ID2 == ID1且日期介于start_date和stop_date之间的X值之和。
我试过
def f(start_date, stop_date, ID, df2):
sub_df2 = df2[df2['date'].between(start_date, stop_date) & df2.ID2 == ID]
return sub_df2.select(F.sum(sub_df2['X'])).collect()[0][0]
udf_f = udf(cumulative_func, DoubleType())
df1 = df1.withColumn('X_sum',
udf_f(df1.start_date, df1.stop_date, df1.ID1, F.lit('X'), df2))
(和其他一些变种),但我不认为pyspark喜欢我试图包含df2。
我正在使用python 2.7和Spark 1.6。