我有多个实木复合地板文件,每个传感器一个,其中包含时间序列数据。我想做的是重新采样并向前填充。我知道在有单个传感器的情况下该怎么做,但是在以有效方式实现同一件事上遇到了问题。
现在,我一次读取一个文件/传感器,对其重新采样,将其向前填充,将其推入数组,然后将我在数组中拥有的所有dfs合并在一起,这非常慢。我想知道是否可以一次读取所有数据,然后使用rdd
而不是df
将每个传感器作为单独的分区进行处理并加入它们。
组合数据如下:
+----------+-------------+----------+-------------+
| epoch| value| timestamp| tag|
+----------+-------------+----------+-------------+
|1493571720| 9.546202E-05|1493571725| SA|
|1493571720| 0.02965982|1493571735| SA|
|1493571720| -0.001335071|1493571745| SB|
|1493571960| 0|1493572005| SB|
|1493571960| 100|1493572005| SB|
|1493571960| 0|1493571985| SC|
|1493571960| 100|1493572005| SC|
|1493572680|-0.0003813824|1493572695| SC|
+----------+-------------+----------+-------------+
这是我到目前为止写的:
# Read each file one by one, resample, forwardfill and join
resample_unit_sec=60
source_path = 'hdfs://s-mac/user/waqas/parquet/tag={}'
files = ['SA', 'SB', 'SC']
for file in files:
parq_df = spark.read.parquet(source_bucket.format(file))
epoch = (F.col("timestamp").cast("bigint") / resample_unit_sec).cast("bigint") * resample_unit_sec
with_epoch = parq_df.withColumn("epoch", epoch)
min_epoch, max_epoch = with_epoch.select(F.min("epoch"), F.max("epoch")).first()
ref = spark.range(min_epoch, max_epoch + 1, resample_unit_sec).toDF("epoch")
final = ref.join(with_epoch, "epoch", "left").orderBy("epoch").withColumn("TS_resampled", F.col("epoch").cast("timestamp"))
dfs.append(final.cache())
for each df:
window_ff = Window.orderBy('TS_resampled').rowsBetween(-sys.maxsize, 0)
ffilled_column = F.last(df['value'], ignorenulls=True).over(window_ff)
filled_final = df.withColumn('ffilled', ffilled_column)
filled_final = filled_final.groupBy(['TS_resampled']).pivot('tag').agg({'ffilled':'mean'}).orderBy('TS_resampled')