我有这样的数据:
PeopleCountTestSchema=StructType([StructField("building",StringType(), True),
StructField("date_created",StringType(), True),
StructField("hour",StringType(), True),
StructField("wirelesscount",StringType(), True),
StructField("rundate",StringType(), True)])
df=spark.read.csv("wasb://reftest@refdev.blob.core.windows.net/Praneeth/HVAC/PeopleCount_test/",schema=PeopleCountTestSchema,sep=",")
df.createOrReplaceTempView('Test')
|building date_created|hour|wirelesscount|
+--------+------------+----+-------------+
|36 |2017-01-02 |0 |35 |
|36 |2017-01-03 |0 |46 |
|36 |2017-01-04 |0 |32 |
|36 |2017-01-05 |0 |90 |
|36 |2017-01-06 |0 |33 |
|36 |2017-01-07 |0 |22 |
|36 |2017-01-08 |0 |11 |
|36 |2017-01-09 |0 |null |
|36 |2017-01-10 |0 |null |
|36 |2017-01-11 |0 |null |
|36 |2017-01-12 |0 |null |
|36 |2017-01-13 |0 |null |
这需要转变为:
|building|date_created|hour|wirelesscount|
+--------+------------+----+-------------+
|36 |2017-01-02 |0 |35 |
|36 |2017-01-03 |0 |46 |
|36 |2017-01-04 |0 |32 |
|36 |2017-01-05 |0 |90 |
|36 |2017-01-06 |0 |33 |
|36 |2017-01-07 |0 |22 |
|36 |2017-01-08 |0 |11 |
|36 |2017-01-09 |0 |35 |
|36 |2017-01-10 |0 |46 |
|36 |2017-01-11 |0 |32 |
|36 |2017-01-12 |0 |90 |
|36 |2017-01-13 |0 |33 |
当前的空值需要替换为第7个先前的值。
我尝试使用:
Test2 = df.withColumn("wirelesscount2", last('wirelesscount', True).over(Window.partitionBy('building','hour').orderBy('hour').rowsBetween(-sys.maxsize, -7)))
结果输出
|building|date_created|hour|wirelesscount|rundate |wirelesscount2|
+--------+------------+----+-------------+----------+--------------+
|36 |2017-01-02 |0 |35 |2017-04-01|null |
|36 |2017-01-03 |0 |46 |2017-04-01|null |
|36 |2017-01-04 |0 |32 |2017-04-01|null |
|36 |2017-01-05 |0 |90 |2017-04-01|null |
|36 |2017-01-06 |0 |33 |2017-04-01|null |
|36 |2017-01-07 |0 |22 |2017-04-01|null |
|36 |2017-01-08 |0 |11 |2017-04-01|null |
|36 |2017-01-09 |0 |null |2017-04-01|35 |
|36 |2017-01-10 |0 |null |2017-04-01|46 |
|36 |2017-01-11 |0 |null |2017-04-01|32 |
|36 |2017-01-12 |0 |null |2017-04-01|90 |
|36 |2017-01-13 |0 |null |2017-04-01|33 |
使用第7个先前的值填充空值,但之前的7个值变为空。
请告诉我,如何处理。
提前致谢!
答案 0 :(得分:0)
您可以使用coalesce完成此操作。
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
Test2 = Test2.withColumn('wirelesscount', Test2.wirelesscount.cast('integer'))
Test2 = Test2.withColumn('wirelesscount2', Test2.wirelesscount2.cast('integer'))
test3 = Test2.withColumn('wirelesscount3', coalesce(Test2.wirelesscount, Test2.wirelesscount2))
test3.show()