我有多个数据帧,它们是从 azure-event-hub 中的一个 json 消息中提取的。我们希望使用 Spark Streaming 作业将这些 DF 推送到 Synapse DW 中的单独表。
这是我的架构 -
root
|-- Name: string (nullable = true)
|-- Salary: string (nullable = true)
|-- EmpID: string (nullable = true)
|-- Projects: struct (nullable = true)
| |-- ProjectID: string (nullable = true)
| |-- ProjectName: string (nullable = true)
| |-- Duration: string (nullable = true)
| |-- Location: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- City: string (nullable = true)
| | | |-- State: string (nullable = true)
| |-- Contact: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Phone: string (nullable = true)
| | | |-- email: string (nullable = true)
我从上面的模式中提取了 4 个不同的数据框 -
它们都应该插入到 Synapse 的 4 个不同的表中
ProjectDf.write.format("spark.sqldw").options(.dbo.Project).save(...)
LocationDf.write.format("spark.sqldw").options(.dbo.Loc).save(...)
ContactDf.write.format("spark.sqldw").options(.dbo.Contact).save(...)
EmployeeDf.write.format("spark.sqldw").options(.dbo.Emp).save(...)
请建议如何在此应用 ForeachBatch sink 以插入表格。
答案 0 :(得分:1)
如果您计划基于单个输入流数据帧编写四个不同的数据帧,您可以通过以下方式使用 foreachBatch
:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
// as you plan to use the batchDF to create multiple output it might be wort persisting the batchDF
batchDF.persist()
// create the four different Dataframes based on the input
val ProjectDf = batchDF.select(...)
val LocationDf = batchDF.select(...)
val ContactDf = batchDF.select(...)
val EmployeeDf = batchDF.select(...)
// then you can save those four Dataframes into the desired locations
ProjectDf.write.format("spark.sqldw").options(.dbo.Project).save(...)
LocationDf.write.format("spark.sqldw").options(.dbo.Loc).save(...)
ContactDf.write.format("spark.sqldw").options(.dbo.Contact).save(...)
EmployeeDf.write.format("spark.sqldw").options(.dbo.Emp).save(...)
// do not forget to unpersist your batchDF
batchDF.unpersist()
}
这在 Using foreach and foreachBatch
的文档中有所描述如果您遇到异常“重载方法 foreachBatch 和替代品”,您可以查看 Databricks Runtime 7.0 的发行说明,其中说:
<块引用>“要修复编译错误,请将 foreachBatch { (df, id) => myFunc(df, id) }
更改为 foreachBatch(myFunc _)
或显式使用 Java API:foreachBatch(new VoidFunction2 ...)。”
这意味着,您的代码将如下所示:
def myFunc(batchDF: DataFrame, batchId: Long): Unit = {
// as you plan to use the batchDF to create multiple output it might be wort persisting the batchDF
batchDF.persist()
// create the four different Dataframes based on the input
val ProjectDf = batchDF.select(...)
val LocationDf = batchDF.select(...)
val ContactDf = batchDF.select(...)
val EmployeeDf = batchDF.select(...)
// then you can save those four Dataframes into the desired locations
ProjectDf.write.format("spark.sqldw").options(.dbo.Project).save(...)
LocationDf.write.format("spark.sqldw").options(.dbo.Loc).save(...)
ContactDf.write.format("spark.sqldw").options(.dbo.Contact).save(...)
EmployeeDf.write.format("spark.sqldw").options(.dbo.Emp).save(...)
// do not forget to unpersist your batchDF
batchDF.unpersist()
}
streamingDF.writeStream.foreachBatch(myFunc _).[...].start()