Question

我有一个以下格式的数据框...

id , name, start_date, end_date  , active
1  , albert   , 2019-08-14, 3499-12-31, 1
1  , albert   , 2019-08-13, 2019-08-14, 0
1  , albert   , 2019-06-26, 2019-08-13, 0
1  , brian   , 2018-01-17, 2019-06-26, 0
1  , brian   , 2017-07-31, 2018-01-17, 0
1  , albert   , 2017-03-31, 2018-07-31, 0
2  , diane   , 2019-07-14, 3499-12-31, 1
2  , diane   , 2019-06-13, 2019-07-14, 0
2  , ethel   , 2019-03-20, 2019-06-13, 0
2  , ethel  , 2018-01-17, 2019-03-20, 0
2  , frank   , 2017-07-31, 2018-01-17, 0
2  , frank   , 2015-03-21, 2018-07-31, 0

我想合并名称与上一行相同的连续行，但要在最终输出数据帧中保持正确的开始和结束日期。因此正确的输出将是...

id , name, start_date, end_date  , active
1  , albert   , 2019-06-26, 3499-12-31, 1
1  , brian   , 2017-07-31, 2019-06-26, 0
1  , albert   , 2017-03-31, 2018-07-31, 0
2  , diane   , 2019-06-13, 3499-12-31, 1
2  , ethel   , 2018-01-17, 2019-06-13, 0
2  , frank   , 2017-03-31, 2018-01-17, 0

每个id的条目数与每个id的不同名称的数目不同。

如何在pyspark中实现？谢谢

Answer 1

您要寻找df.groupby(["name", "start_date", "end_date"]).sum("active")吗？

如果我正确理解了您的问题，那么上面的代码就可以完成工作。

Answer 2

因此，经过一番思考，我弄清楚了如何做到这一点。也许有更好的方法，但这可行。

首先创建一个窗口，按ID分区并按start_date排序，然后捕获下一行。

frame = Window.partitionBy('id').orderBy(col('start_date').desc())
df = df.select('*', lag(col('name'), default=0).over(frame).alias('next_name'))

然后，如果当前名称行和名字匹配设置0，否则设置1 ...

df = df.withColumn('countrr', when(col('name') == col('next_name'), 0).otherwise(1))

接下来，创建框架的扩展名，以获取窗口开头和当前行之间的行，并对框架的计数列求和...

frame2 = Window.partitionBy('id').orderBy(col('start_date').desc()).rowsBetween(Window.unboundedPreceding, Window.currentRow)
df = df.withColumn('sumrr', sum('countrr').over(frame2)

这有效地创建了一个在名称更改时增加一列的列。最后，您可以使用此新的sumrr列和其他列进行分组，并根据需要选择最大和最小日期...

gb_df = df.groupby(['id', 'name', 'sumrr'])
result = gb_df.agg({'start_date':'min', 'end_date':'max'})

然后，您必须重新加入ID，名称和结束日期上的有效标志。

给出正确的输出...

Pyspark-合并连续重复的行，但保留开始和结束日期

2 个答案: