我有一个与以下问题类似的问题。
Combine Date Ranges in Pandas Dataframe
但是我正在处理庞大的数据集。我试图看看我是否可以在pyspark而不是熊猫中做同样的事情。以下是熊猫的解决方案。可以在pyspark中完成吗?
def merge_dates(grp):
# Find contiguous date groups, and get the first/last start/end date for each group.
dt_groups = (grp['StartDate'] != grp['EndDate'].shift()).cumsum()
return grp.groupby(dt_groups).agg({'StartDate': 'first', 'EndDate': 'last'})
# Perform a groupby and apply the merge_dates function, followed by formatting.
df = df.groupby(['FruitID', 'FruitType']).apply(merge_dates)
df = df.reset_index().drop('level_2', axis=1)
答案 0 :(得分:1)
我们可以使用Window
和lag
函数来计算连续的组,然后以与您共享的Pandas
函数类似的方式聚合它们。下面给出一个可行的示例,希望对您有所帮助!
import pandas as pd
from dateutil.parser import parse
from pyspark.sql.window import Window
import pyspark.sql.functions as F
# EXAMPLE DATA -----------------------------------------------
pdf = pd.DataFrame.from_items([('FruitID', [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]),
('FruitType', ['Apple', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Banana', 'Banana', 'Blueberry', 'Mango', 'Kiwi', 'Mango']),
('StartDate', [parse(x) for x in ['2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-05-31',
'2017-01-01', '2015-01-01', '2016-01-01', '2017-01-01', '2015-01-01', '2016-09-15', '2017-01-01']]),
('EndDate', [parse(x) for x in ['2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01',
'2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2016-01-01', '2017-01-01', '2018-01-01']])
])
pdf.sort_values(['FruitID', 'StartDate'])
df = sqlContext.createDataFrame(pdf)
# FIND CONTIGUOUS GROUPS AND AGGREGATE ---------------------
w = Window.partitionBy("FruitType").orderBy("StartDate")
contiguous = F.when(F.datediff(F.lag("EndDate", 1).over(w),F.col("StartDate"))!=0,F.lit(1)).otherwise(F.lit(0))
df = (df
.withColumn('contiguous_grp', F.sum(contiguous).over(w))
.groupBy('FruitType','contiguous_grp')
.agg(F.first('StartDate').alias('StartDate'),F.last('EndDate').alias('EndDate'))
.drop('contiguous_grp'))
df.show()
输出:
+---------+-------------------+-------------------+
|FruitType| StartDate| EndDate|
+---------+-------------------+-------------------+
| Orange|2015-01-01 00:00:00|2016-01-01 00:00:00|
| Orange|2016-05-31 00:00:00|2018-01-01 00:00:00|
| Banana|2015-01-01 00:00:00|2017-01-01 00:00:00|
| Kiwi|2016-09-15 00:00:00|2017-01-01 00:00:00|
| Mango|2015-01-01 00:00:00|2016-01-01 00:00:00|
| Mango|2017-01-01 00:00:00|2018-01-01 00:00:00|
| Apple|2015-01-01 00:00:00|2018-01-01 00:00:00|
|Blueberry|2017-01-01 00:00:00|2018-01-01 00:00:00|
+---------+-------------------+-------------------+