我按以下两列对以下数据帧进行了排序:id
和Updated_date
。
初始数据框:
|id|date |Updated_date |
|a |2019-02-14|2018-10-30 10:25:45|
|a |2019-02-14|2018-11-28 10:51:34|
|a |2019-01-11|2018-11-29 10:46:07|
|a |2019-01-14|2018-11-30 10:42:56|
|a |2019-01-16|2018-12-01 10:28:46|
|a |2019-01-22|2018-12-02 10:22:06|
|b |2019-01-25|2018-11-15 10:36:59|
|b |2019-02-10|2018-11-16 10:58:01|
|b |2019-02-04|2018-11-17 10:42:12|
|b |2019-02-10|2018-11-24 10:24:56|
|b |2019-02-02|2018-12-01 10:28:46|
我想以这样的方式创建两个新列LB
和UB
:
对于每个id
,LB
和UB
的前一个值是间隔(日期+/- 10天)的值,对于下一个具有相同的{{1} },我们验证id
是否在上一行的date
和LB
之间,如果是,我们使用相同的值,否则,我们重新计算新的间隔(+/- 10天)。
我的预期输出:
UB
如何遍历每个组中的行?
答案 0 :(得分:0)
如果您对udf没问题。
from datetime import timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StructType
SS = SparkSession.builder.getOrCreate()
data = [{"id": "a", "date": "2019-02-14", "updated_date": "2018-10-30 10:25:45"},
{"id": "a", "date": "2019-02-14", "updated_date": "2018-11-28 10:51:34"},
{"id": "a", "date": "2019-01-11", "updated_date": "2018-11-29 10:46:07"},
{"id": "a", "date": "2019-01-14", "updated_date": "2018-11-30 10:42:56"},
{"id": "a", "date": "2019-01-16", "updated_date": "2018-12-01 10:28:46"},
{"id": "a", "date": "2019-01-22", "updated_date": "2018-12-02 10:22:06"},
{"id": "b", "date": "2019-01-25", "updated_date": "2018-11-15 10:36:59"},
{"id": "b", "date": "2019-02-10", "updated_date": "2018-11-16 10:58:01"},
{"id": "b", "date": "2019-02-04", "updated_date": "2018-11-17 10:42:12"},
{"id": "b", "date": "2019-02-10", "updated_date": "2018-11-24 10:24:56"},
{"id": "b", "date": "2019-02-02", "updated_date": "2018-12-01 10:28:46"}]
schema = {
"fields": [
{
"metadata": {},
"name": "date",
"nullable": False,
"type": "date"
},
{
"metadata": {},
"name": "id",
"nullable": False,
"type": "string"
},
{
"metadata": {},
"name": "updated_date",
"nullable": False,
"type": "timestamp"
},
]
}
@udf("date")
def increment(cell):
return cell+timedelta(days=10)
@udf("date")
def decrease(cell):
return cell+timedelta(days=-10)
df = SS.createDataFrame(data)
for field in schema["fields"]:
df = df.withColumn(field["name"], df[field["name"]].cast(field["type"]))
df.show()
df.printSchema()
df = df.withColumn("UB", increment("date"))
df = df.withColumn("LB", decrease("date"))
df.show()
df.printSchema()
我的输出:
# After aplying schema
+----------+---+-------------------+
| date| id| updated_date|
+----------+---+-------------------+
|2019-02-14| a|2018-10-30 10:25:45|
|2019-02-14| a|2018-11-28 10:51:34|
|2019-01-11| a|2018-11-29 10:46:07|
|2019-01-14| a|2018-11-30 10:42:56|
|2019-01-16| a|2018-12-01 10:28:46|
|2019-01-22| a|2018-12-02 10:22:06|
|2019-01-25| b|2018-11-15 10:36:59|
|2019-02-10| b|2018-11-16 10:58:01|
|2019-02-04| b|2018-11-17 10:42:12|
|2019-02-10| b|2018-11-24 10:24:56|
|2019-02-02| b|2018-12-01 10:28:46|
+----------+---+-------------------+
root
|-- date: date (nullable = true)
|-- id: string (nullable = true)
|-- updated_date: timestamp (nullable = true)
+----------+---+-------------------+----------+----------+
| date| id| updated_date| UB| LB|
+----------+---+-------------------+----------+----------+
|2019-02-14| a|2018-10-30 10:25:45|2019-02-24|2019-02-04|
|2019-02-14| a|2018-11-28 10:51:34|2019-02-24|2019-02-04|
|2019-01-11| a|2018-11-29 10:46:07|2019-01-21|2019-01-01|
|2019-01-14| a|2018-11-30 10:42:56|2019-01-24|2019-01-04|
|2019-01-16| a|2018-12-01 10:28:46|2019-01-26|2019-01-06|
|2019-01-22| a|2018-12-02 10:22:06|2019-02-01|2019-01-12|
|2019-01-25| b|2018-11-15 10:36:59|2019-02-04|2019-01-15|
|2019-02-10| b|2018-11-16 10:58:01|2019-02-20|2019-01-31|
|2019-02-04| b|2018-11-17 10:42:12|2019-02-14|2019-01-25|
|2019-02-10| b|2018-11-24 10:24:56|2019-02-20|2019-01-31|
|2019-02-02| b|2018-12-01 10:28:46|2019-02-12|2019-01-23|
+----------+---+-------------------+----------+----------+
root
|-- date: date (nullable = true)
|-- id: string (nullable = true)
|-- updated_date: timestamp (nullable = true)
|-- UB: date (nullable = true)