如何在Spark结构化流中将两个流df写入MySQL的两个不同表中?

时间:2020-10-07 15:30:43

标签: apache-spark spark-structured-streaming

我正在使用spark 2.3.2版本。

我已经在Spark结构化流中编写了代码,以将流数据帧数据插入到两个不同的MySQL表中。

假设有两个流df:DF1,DF2。

我已经使用foreachWriter API编写了两个查询(query1,query2),分别从不同的流中写入MySQL表。即DF1进入MYSQLtable A,DF2进入MYSQL表B。

当我运行spark作业时,它首先运行query1,然后运行query2,因此它将写入表A,而不是表B。

如果我更改代码以先运行query2然后再运行query1,则将其写入表B,但不写入表A。

因此,我了解到它只是在向表中写入时才执行第一个查询。

注意:我尝试分别给两个表提供不同的MySQL用户/数据库。但是没有运气。

有人可以建议吗?如何使其工作。

我的代码如下:

import java.sql._

class  JDBCSink1(url:String, user:String, pwd:String) extends ForeachWriter[org.apache.spark.sql.Row] {
      val driver = "com.mysql.jdbc.Driver"
      var connection:Connection = _
      var statement:Statement = _
      
    def open(partitionId: Long,version: Long): Boolean = {
        Class.forName(driver)
        connection = DriverManager.getConnection(url, user, pwd)
        statement = connection.createStatement
        true
      }

      def process(value: (org.apache.spark.sql.Row)): Unit = {

        val insertSql = """ INSERT INTO tableA(col1,col2,col3) VALUES(?,?,?); """
        val preparedStmt: PreparedStatement = connection.prepareStatement(insertSql)
        preparedStmt.setString (1, value(0).toString)
        preparedStmt.setString (2, value(1).toString)
        preparedStmt.setString (3, value(2).toString)
        preparedStmt.execute
      }

      def close(errorOrNull: Throwable): Unit = {
        connection.close
      }
   }



class  JDBCSink2(url:String, user:String, pwd:String) extends ForeachWriter[org.apache.spark.sql.Row] {
      val driver = "com.mysql.jdbc.Driver"
      var connection:Connection = _
      var statement:Statement = _
      
    def open(partitionId: Long,version: Long): Boolean = {
        Class.forName(driver)
        connection = DriverManager.getConnection(url, user, pwd)
        statement = connection.createStatement
        true
      }

      def process(value: (org.apache.spark.sql.Row)): Unit = {

        val insertSql = """ INSERT INTO tableB(col1,col2) VALUES(?,?); """
        val preparedStmt: PreparedStatement = connection.prepareStatement(insertSql)
        preparedStmt.setString (1, value(0).toString)
        preparedStmt.setString (2, value(1).toString)
        preparedStmt.execute
      }

      def close(errorOrNull: Throwable): Unit = {
        connection.close
      }
   }



val url1="jdbc:mysql://hostname:3306/db1"
val url2="jdbc:mysql://hostname:3306/db2"

val user1 ="usr1"
val user2="usr2"
val pwd = "password"

val Writer1 = new JDBCSink1(url1,user1, pwd)

val Writer2 = new JDBCSink2(url2,user2, pwd)


val query2 =
  streamDF2
    .writeStream
    .foreach(Writer2)
    .outputMode("append")
    .trigger(ProcessingTime("35 seconds"))
    .start().awaitTermination()



val query1 =
  streamDF1
    .writeStream
    .foreach(Writer1)
    .outputMode("append")
    .trigger(ProcessingTime("30 seconds"))
    .start().awaitTermination()

1 个答案:

答案 0 :(得分:3)

由于awaitTermination,您正在阻止第二个查询。如果要有两个输出流,则需要在等待它们终止之前先启动两个流:

val query2 =
  streamDF2
    .writeStream
    .foreach(Writer2)
    .outputMode("append")
    .trigger(ProcessingTime("35 seconds"))
    .start()

val query1 =
  streamDF1
    .writeStream
    .foreach(Writer1)
    .outputMode("append")
    .trigger(ProcessingTime("30 seconds"))
    .start()

query1.awaitTermination()
query2.awaitTermination()

编辑:

Spark还使您可以如Scheduling within an application中所述为不同的流查询调度和分配资源。您可以根据以下条件配置池:

  • schedulingMode :可以为FIFOFAIR
  • weight :“这控制着群集相对于其他池在群集中所占的份额。默认情况下,所有池的权重均为1。例如,如果给定特定池的权重为2,例如,它将获得比其他活动池多2倍的资源。”
  • minShare :“除了总的重量之外,还可以为每个池分配管理员希望拥有的最小份额(作为CPU核心的数量)。”

可以通过创建类似于conf/fairscheduler.xml.template的XML文件并在类路径上放置名为fairscheduler.xml的文件或在SparkConf中设置spark.scheduler.allocation.file属性来设置池配置。

conf.set("spark.scheduler.allocation.file", "/path/to/file")

可以通过以下方式应用不同的池:

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")

// In the above example you could then tell Spark to make use of the pools
val query1 = streamDF1.writeStream.[...].start(pool1)
val query2 = streamDF2.writeStream.[...].start(pool2)