我有一个带有“ id”列和“ time”列的示例DataFrame。 我想导出3个新列。
我想我已经确定了1和2。 我需要第三个帮助。 对于项目3。相应的“ id”的值应如下所示
3 @(11)
我的代码示例:
from pyspark.sql import functions as F, Window
df = (sc.parallelize([
{ "id":"1@" ,"time":"2018-09-13" },
{ "id":"1@" ,"time":"2018-09-14" },
{ "id":"2@" ,"time":"2018-10-17" },
{ "id":"2@" ,"time":"2018-10-18" },
{ "id":"2@" ,"time":"2018-10-19" },
{ "id":"2@" ,"time":"2018-10-20" },
{ "id":"2@" ,"time":"2018-10-21" },
{ "id":"2@" ,"time":"2018-10-22" },
{ "id":"2@" ,"time":"2018-10-23" },
{ "id":"3@" ,"time":"2018-11-09" },
{ "id":"3@" ,"time":"2018-11-10" },
{ "id":"3@" ,"time":"2018-11-11" },
{ "id":"3@" ,"time":"2018-11-12" },
{ "id":"3@" ,"time":"2018-11-13" },
{ "id":"3@" ,"time":"2018-11-14" },
{ "id":"3@" ,"time":"2018-11-15" },
{ "id":"3@" ,"time":"2018-11-16" },
{ "id":"3@" ,"time":"2018-11-17" },
{ "id":"3@" ,"time":"2018-11-18" },
{ "id":"3@" ,"time":"2018-11-19" }
]).toDF()
.cache()
)
(
df
.withColumn
(
'min',
F.min('time')
.over
(
Window.partitionBy
(
F.col('id')
)
)
)
.withColumn
(
'group_size',
F.size
(
F.collect_set('time')
.over
(
Window.partitionBy
(
F.col('id')
)
)
)
)
.withColumn
(
'overall_size',
F.size
(
F.collect_set('time')
.over
(
Window.partitionBy
(
)
)
)
)
.withColumn
(
'overall_size_from_first_group_appearance',
F.size
(
F.collect_set
(
F.when
(
F.min('time')
.over
(
Window.partitionBy
(
F.col('id')
)
)
<=
F.col('time'),
F.col('time')
)
)
.over
(
Window.partitionBy
(
)
)
)
)
.orderBy
(
F.col('time').asc()
)
.show(truncate = False)
)
这是屏幕截图中我需要帮助的最后一列。
答案 0 :(得分:0)
这是我对这个问题的攻击:
df = df.withColumn('rank', F.row_number().over(Window.orderBy(F.col('time').desc())))
df = df.withColumn('overall_size_from_first_group_appearance', F.max('rank').over(Window.partitionBy('id').orderBy('time')))
df.show()
输出如下:
+---+----------+----+----------------------------------------+
| id| time|rank|overall_size_from_first_group_appearance|
+---+----------+----+----------------------------------------+
| 1@|2018-09-13| 20| 20|
| 1@|2018-09-14| 19| 20|
| 2@|2018-10-17| 18| 18|
| 2@|2018-10-18| 17| 18|
| 2@|2018-10-19| 16| 18|
| 2@|2018-10-20| 15| 18|
| 2@|2018-10-21| 14| 18|
| 2@|2018-10-22| 13| 18|
| 2@|2018-10-23| 12| 18|
| 3@|2018-11-09| 11| 11|
| 3@|2018-11-10| 10| 11|
| 3@|2018-11-11| 9| 11|
| 3@|2018-11-12| 8| 11|
| 3@|2018-11-13| 7| 11|
| 3@|2018-11-14| 6| 11|
| 3@|2018-11-15| 5| 11|
| 3@|2018-11-16| 4| 11|
| 3@|2018-11-17| 3| 11|
| 3@|2018-11-18| 2| 11|
| 3@|2018-11-19| 1| 11|
+---+----------+----+----------------------------------------+
希望有帮助!