Question

我有多个实木复合地板文件，每个传感器一个，其中包含时间序列数据。我想做的是重新采样并向前填充。我知道在有单个传感器的情况下该怎么做，但是在以有效方式实现同一件事上遇到了问题。

现在，我一次读取一个文件/传感器，对其重新采样，将其向前填充，将其推入数组，然后将我在数组中拥有的所有dfs合并在一起，这非常慢。我想知道是否可以一次读取所有数据，然后使用rdd而不是df将每个传感器作为单独的分区进行处理并加入它们。

组合数据如下：

+----------+-------------+----------+-------------+
|     epoch|        value| timestamp|          tag|
+----------+-------------+----------+-------------+
|1493571720| 9.546202E-05|1493571725|           SA|
|1493571720|   0.02965982|1493571735|           SA|
|1493571720| -0.001335071|1493571745|           SB|
|1493571960|            0|1493572005|           SB|
|1493571960|          100|1493572005|           SB|
|1493571960|            0|1493571985|           SC|
|1493571960|          100|1493572005|           SC|
|1493572680|-0.0003813824|1493572695|           SC|
+----------+-------------+----------+-------------+

这是我到目前为止写的：

# Read each file one by one, resample, forwardfill and join
resample_unit_sec=60
source_path = 'hdfs://s-mac/user/waqas/parquet/tag={}'
files = ['SA', 'SB', 'SC']

for file in files:
    parq_df = spark.read.parquet(source_bucket.format(file))
    epoch = (F.col("timestamp").cast("bigint") / resample_unit_sec).cast("bigint") * resample_unit_sec
    with_epoch = parq_df.withColumn("epoch", epoch)
    min_epoch, max_epoch = with_epoch.select(F.min("epoch"), F.max("epoch")).first()
    ref = spark.range(min_epoch, max_epoch + 1, resample_unit_sec).toDF("epoch")
    final = ref.join(with_epoch, "epoch", "left").orderBy("epoch").withColumn("TS_resampled", F.col("epoch").cast("timestamp"))
    dfs.append(final.cache())

for each df:
    window_ff = Window.orderBy('TS_resampled').rowsBetween(-sys.maxsize, 0)
    ffilled_column = F.last(df['value'], ignorenulls=True).over(window_ff)
    filled_final = df.withColumn('ffilled', ffilled_column)
    filled_final = filled_final.groupBy(['TS_resampled']).pivot('tag').agg({'ffilled':'mean'}).orderBy('TS_resampled')

使用多个标签重新采样并转发填充数据

0 个答案: