我在pyspark中有一个数据框。这是它的样子,
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098930| 53 |
|670098934| 55 |
+---------+---------+
我想用前一个状态填写时间戳的空白,这样我就可以得到一个完美的集来计算时间加权平均值。这是输出应该是什么 -
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098929| 50 |
|670098930| 53 |
|670098931| 53 |
|670098932| 53 |
|670098933| 53 |
|670098934| 55 |
+---------+---------+
最终,我想在磁盘上保留这个新的数据框并可视化我的分析。
我如何在pyspark中执行此操作? (为简单起见,我只保留了2列。我的实际数据帧有89列,约6.7亿条记录,填补空白。)
答案 0 :(得分:1)
您可以生成时间戳范围,展平它们并选择行
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+