Question

我有一个Spark结构化的流作业，该作业可以读取CSV文件，运行一些计算并输出文本文件以供下游模型使用。输出是由已连接（由空格分隔）的原始列组成的单个列。例如：

1556951121 7.19 26.6 36.144 14.7402 1
1556951122 7.59 27.1 37.697 14.7402 1
1556951123 8.01 27.7 39.328 14.7403 0
etc.

下游模型在文件顶部需要一些额外的标头信息。它需要第一行的文件名和第二行的列数。例如：

filename
6
1556951121 7.19 26.6 36.144 14.7402 1
1556951122 7.59 27.1 37.697 14.7402 1
1556951123 8.01 27.7 39.328 14.7403 0
etc.

这可以在Spark中完成吗？我将标头信息创建为单独的数据框：

header = [('filename',), ('6',)]
rdd = sparkSession.sparkContext.parallelize(header)
headerDF = sparkSession.createDataFrame(rdd, schema=StructType([StructField('values', StringType(), False)]))

我尝试过union，但是不支持流数据帧和静态数据帧之间的联合。

我也查看了join，但我认为这不会满足我的需要，因为这会增加一列。

有关信息，这是输出查询：

df.coalesce(1)\
  .writeStream\
  .outputMode("append")\
  .format("text")\
  .option("checkpointLocation", checkpoint_path)\
  .option("path", path)\
  .start()\
  .awaitTermination()

这是输入源：

df = sparkSession.readStream\
                 .option("header", "true")\
                 .option("maxFilesPerTrigger", 1)\
                 .schema(schema)\
                 .csv(input_path)

输入CSV仅由时间戳和一些传感器值组成。例如：

Timestamp,Sensor1,Sensor2,Sensor3,Sensor4,Sensor5
1556951121,7.19,26.6,36.144,14.7402,True
1556951122,7.59,27.1,37.697,14.7402,True
1556951123,8.01,27.7,39.328,14.7403,False

Answer 1

最后，我使用了foreachBatch接收器，因为它为您提供了一个静态数据框，然后您可以将其加入/合并到其他数据框：

df.coalesce(1).writeStream.foreachBatch(foreach_batch_function).start()

和foreach批处理功能：

def foreach_batch_function(df, epoch_id):
     complete_df = headerDF.union(df)
     complete_df.coalesce(1).write.text(os.path.join(output_path, str(epoch_id)))

将Spark结构化的流数据帧与静态数据帧连接

1 个答案: