我有一张这样的桌子:
+--------------------+--------------------+-------------------+
| ID| point| timestamp|
+--------------------+--------------------+-------------------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|
+--------------------+--------------------+-------------------+
我想添加一列point1
,其值与列point
相同,但具有转换后的行,且最后一点等于0
+--------------------+--------------------+-------------------+---------+---------+------+
| ID| point| timestamp| lon| lat|point1|
+--------------------+--------------------+-------------------+---------+---------+------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|-73.26763|40.850548|POINT (-73.271446...|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|-73.27145| 40.85318|POINT (-73.265991...|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|-73.26599|40.851482|POINT (-73.271446...|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|-73.27145|40.853184|POINT (-73.265609...|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|-73.26561|40.854164| 0|
答案 0 :(得分:0)
使用窗口 lead
函数基于 point1
列获取生成的 point
数据< / p>
null
作为潜在客户,则将其替换为0
df.show()
#+-----------------+-----------------+-------------------+---------+---------+
#| ID| point| timestamp| lon| lat|
#+-----------------+-----------------+-------------------+---------+---------+
#|679ac975acc4bdec9|POINT (-73.267631|2020-01-01 17:10:49|-73.26763|40.850548|
#|679ac975acc4bdec9|POINT (-73.271446|2020-01-01 02:12:31|-73.27145| 40.85318|
#|679ac975acc4bdec9|POINT (-73.265991|2020-01-01 17:10:40|-73.26599|40.851482|
#+-----------------+-----------------+-------------------+---------+---------+
from pyspark.sql.window import Window
from pyspark.sql.functions import *
#change orderby column if you need some specific order based on some column
w = Window.partitionBy('ID').orderBy(lit("1"))
df.withColumn("point1",lead("point",1,0).over(w)).show()
#+-----------------+-----------------+-------------------+---------+---------+-----------------+
#| ID| point| timestamp| lon| lat| point1|
#+-----------------+-----------------+-------------------+---------+---------+-----------------+
#|679ac975acc4bdec9|POINT (-73.267631|2020-01-01 17:10:49|-73.26763|40.850548|POINT (-73.271446|
#|679ac975acc4bdec9|POINT (-73.271446|2020-01-01 02:12:31|-73.27145| 40.85318|POINT (-73.265991|
#|679ac975acc4bdec9|POINT (-73.265991|2020-01-01 17:10:40|-73.26599|40.851482| 0|
#+-----------------+-----------------+-------------------+---------+---------+-----------------+