输入pyspark数据帧每key_id
和date_month
有一行。对于一个随机的key_id
,它看起来像这样
+--------+-------------+---------+---------+
| key_id | date_month | value_1 | value_2 |
+--------+-------------+---------+---------+
| 1 | 2019-02-01 | 1.135 | 'a' |
| 1 | 2019-03-01 | 0.165 | 'b' |
| 1 | 2019-04-01 | 0.0 | null |
+--------+-------------+---------+---------+
它需要重新采样为每周粒度才能看起来像这样
+--------+-------------+---------+---------+
| key_id | date_week | value_1 | value_2 |
+--------+-------------+---------+---------+
| 1 | 2019-02-04 | 1.135 | 'a' |
| 1 | 2019-02-11 | 1.135 | 'a' |
| 1 | 2019-02-18 | 1.135 | 'a' |
| 1 | 2019-02-25 | 1.135 | 'a' |
| 1 | 2019-03-04 | 0.165 | 'b' |
| 1 | 2019-03-11 | 0.165 | 'b' |
| 1 | 2019-03-18 | 0.165 | 'b' |
| 1 | 2019-03-25 | 0.165 | 'b' |
| 1 | 2019-04-01 | 0.0 | null |
| 1 | 2019-04-08 | 0.0 | null |
| 1 | 2019-04-15 | 0.0 | null |
| 1 | 2019-04-22 | 0.0 | null |
| 1 | 2019-04-29 | 0.0 | null |
+--------+-------------+---------+---------+
目前在PySpark数据框和熊猫之间切换的代码约为30行:日期范围,联接等。
在PySpark中有一种简单的方法吗?
我尝试过Pandas resampling from months to weeks,但是当我的“主键”是date_month
和key_id
的组合时,我不知道如何使它工作。
目前,初始数据帧中的行数低至约250K,我猜想,转换PySpark数据帧toPandas()
,然后在Pandas中进行转换是可行的选择。
答案 0 :(得分:0)
下面的解决方案涉及制作一个几个月到几周(其中几周是一个月的星期一)的映射器,并将其添加到原始数据中。
模拟数据的无聊部分:
## Replicate data with join trick to get out nulls
## Convert string to date format
import pyspark.sql.functions as F
c = ['key_id','date_month','value_1']
d = [(1,'2019-02-01',1.135),
(1,'2019-03-01',0.165),
(1,'2019-04-01',0.0)]
c2 = ['date_month','value_2']
d2 = [('2019-02-01','a'),
('2019-03-01','b')]
df = spark.createDataFrame(d,c)
df2 = spark.createDataFrame(d2,c2)
test_df = df.join(df2, how = 'left', on = 'date_month')
test_df_date = test_df.withColumn('date_month', F.to_date(test_df['date_month']))
test_df_date.orderBy('date_month').show()
您的数据:
+----------+------+-------+-------+
|date_month|key_id|value_1|value_2|
+----------+------+-------+-------+
|2019-02-01| 1| 1.135| a|
|2019-03-01| 1| 0.165| b|
|2019-04-01| 1| 0.0| null|
+----------+------+-------+-------+
使用来自get all the dates between two dates in Spark DataFrame
的巧妙技巧来构建映射器以月份的映射器结尾,到一个月的周开始(您可以直接对原始数据执行此操作,而无需创建映射器。)
## Build month to week mapper
## Get first and last of each month, and number of days between
months = test_df_date.select('date_month').distinct()
months = months.withColumn('date_month_end', F.last_day(F.col('date_month')))
months = months.withColumn('days', F.datediff(F.col('date_month_end'),
F.col('date_month')))
## Use trick from https://stackoverflow.com/questions/51745007/get-all-the-dates-between-two-dates-in-spark-dataframe
## Adds a column 'day_in_month' with all days in the month from first to last.
##
months = months.withColumn("repeat", F.expr("split(repeat(',', days), ',')"))\
.select("*", F.posexplode("repeat").alias("day_in_month", "val"))\
.drop("repeat", "val", "days")\
.withColumn("day_in_month", F.expr("date_add(date_month, day_in_month)"))\
## Add integer day of week value - Sunday == 1, Monday == 2,
## Filter by mondays,
## Rename and drop columns
months = months.withColumn('day', F.dayofweek(F.col('day_in_month')))
months = months.filter(F.col('day') == 2)
month_week_mapper = months.withColumnRenamed('day_in_month', 'date_week')\
.drop('day', 'date_month_end')
month_week_mapper.orderBy('date_week').show()
映射器如下:
+----------+----------+
|date_month| date_week|
+----------+----------+
|2019-02-01|2019-02-04|
|2019-02-01|2019-02-11|
|2019-02-01|2019-02-18|
|2019-02-01|2019-02-25|
|2019-03-01|2019-03-04|
|2019-03-01|2019-03-11|
|2019-03-01|2019-03-18|
|2019-03-01|2019-03-25|
|2019-04-01|2019-04-01|
|2019-04-01|2019-04-08|
|2019-04-01|2019-04-15|
|2019-04-01|2019-04-22|
|2019-04-01|2019-04-29|
+----------+----------+
然后我们对原始数据进行左连接,每个月都会连接到各个星期。最后一行只是删除多余的列,并对行/列进行重新排序以匹配所需的输出。
## Perform the join, and do some cleanup to get results into order/format specified above.
out_df = test_df_date.join(month_week_mapper, on = 'date_month', how = 'left')
out_df.drop('date_month')\
.select('key_id','date_week','value_1','value_2')\
.orderBy('date_week')\
.show()
## Gives me an output of:
+------+----------+-------+-------+
|key_id| date_week|value_1|value_2|
+------+----------+-------+-------+
| 1|2019-02-04| 1.135| a|
| 1|2019-02-11| 1.135| a|
| 1|2019-02-18| 1.135| a|
| 1|2019-02-25| 1.135| a|
| 1|2019-03-04| 0.165| b|
| 1|2019-03-11| 0.165| b|
| 1|2019-03-18| 0.165| b|
| 1|2019-03-25| 0.165| b|
| 1|2019-04-01| 0.0| null|
| 1|2019-04-08| 0.0| null|
| 1|2019-04-15| 0.0| null|
| 1|2019-04-22| 0.0| null|
| 1|2019-04-29| 0.0| null|
+------+----------+-------+-------+
这应该与您的KeyID列一起使用,尽管您需要使用一些稍微不同的数据来进行测试以确保。
我绝对会提倡像上面那样做,而不要转换为熊猫再返回。 df.toPandas
的运行速度非常慢,如果数据大小随时间增加,Pandas方法有时会失败,而您(或曾经维护代码的人)将遇到此问题。