Question

我正在pyspark结构化流中使用foreachBatch，以使用JDBC将每个微批次写入SQL Server。我需要对多个表使用相同的过程，并且我想通过为表名添加一个额外的参数来重用相同的writer函数，但是我不确定如何传递表名参数。

示例here很有帮助，但是在python示例中，表名是硬编码的，看起来像在scala示例中，他们正在引用全局变量（？），我想传递名称将该表插入功能。

上面链接中python示例中给出的函数是：

def writeToSQLWarehose(df, epochId):
  df.write \
    .format("com.databricks.spark.sqldw") \
    .mode('overwrite') \
    .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
    .option("forward_spark_azure_storage_credentials", "true") \
    .option("dbtable", "my_table_in_dw_copy") \
    .option("tempdir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
    .save()

我想使用这样的东西：

def writeToSQLWarehose(df, epochId, tableName):
  df.write \
    .format("com.databricks.spark.sqldw") \
    .mode('overwrite') \
    .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
    .option("forward_spark_azure_storage_credentials", "true") \
    .option("dbtable", tableName) \
    .option("tempdir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
    .save()

但是我不确定如何通过foreachBatch传递附加参数。

Answer 1

类似的事情应该起作用。

streamingDF.writeStream.foreachBatch(lambda df,epochId: writeToSQLWarehose(df, epochId,tableName )).start()

将其他参数传递给pyspark中的foreachBatch

1 个答案: