如何在Pysapar SQL数据框中添加具有特定值的列?

时间:2020-03-28 13:05:43

标签: python sql pyspark pyspark-sql

我有一张这样的桌子:

+--------------------+--------------------+-------------------+
|                  ID|               point|          timestamp|
+--------------------+--------------------+-------------------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|
+--------------------+--------------------+-------------------+

我想添加一列point1,其值与列point相同,但具有转换后的行,且最后一点等于0

+--------------------+--------------------+-------------------+---------+---------+------+
|                  ID|               point|          timestamp|      lon|      lat|point1|
+--------------------+--------------------+-------------------+---------+---------+------+
|679ac975acc4bdec9...|POINT (-73.267631...|2020-01-01 17:10:49|-73.26763|40.850548|POINT (-73.271446...|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:12:31|-73.27145| 40.85318|POINT (-73.265991...|
|679ac975acc4bdec9...|POINT (-73.265991...|2020-01-01 17:10:40|-73.26599|40.851482|POINT (-73.271446...|
|679ac975acc4bdec9...|POINT (-73.271446...|2020-01-01 02:54:15|-73.27145|40.853184|POINT (-73.265609...|
|679ac975acc4bdec9...|POINT (-73.265609...|2020-01-01 17:10:24|-73.26561|40.854164|     0|

1 个答案:

答案 0 :(得分:0)

使用窗口 lead 函数基于 point1 列获取生成的 point 数据< / p>

  • 如果找到null作为潜在客户,则将其替换为0

df.show()
#+-----------------+-----------------+-------------------+---------+---------+
#|               ID|            point|          timestamp|      lon|      lat|
#+-----------------+-----------------+-------------------+---------+---------+
#|679ac975acc4bdec9|POINT (-73.267631|2020-01-01 17:10:49|-73.26763|40.850548|
#|679ac975acc4bdec9|POINT (-73.271446|2020-01-01 02:12:31|-73.27145| 40.85318|
#|679ac975acc4bdec9|POINT (-73.265991|2020-01-01 17:10:40|-73.26599|40.851482|
#+-----------------+-----------------+-------------------+---------+---------+

from pyspark.sql.window import Window
from pyspark.sql.functions import *

#change orderby column if you need some specific order based on some column
w = Window.partitionBy('ID').orderBy(lit("1"))

df.withColumn("point1",lead("point",1,0).over(w)).show()
#+-----------------+-----------------+-------------------+---------+---------+-----------------+
#|               ID|            point|          timestamp|      lon|      lat|           point1|
#+-----------------+-----------------+-------------------+---------+---------+-----------------+
#|679ac975acc4bdec9|POINT (-73.267631|2020-01-01 17:10:49|-73.26763|40.850548|POINT (-73.271446|
#|679ac975acc4bdec9|POINT (-73.271446|2020-01-01 02:12:31|-73.27145| 40.85318|POINT (-73.265991|
#|679ac975acc4bdec9|POINT (-73.265991|2020-01-01 17:10:40|-73.26599|40.851482|                0|
#+-----------------+-----------------+-------------------+---------+---------+-----------------+