我正在使用spark 2.3.2版本。
我已经在Spark结构化流中编写了代码,以将流数据帧数据插入到两个不同的MySQL表中。
假设有两个流df:DF1,DF2。
我已经使用foreachWriter API编写了两个查询(query1,query2),分别从不同的流中写入MySQL表。即DF1进入MYSQLtable A,DF2进入MYSQL表B。
当我运行spark作业时,它首先运行query1,然后运行query2,因此它将写入表A,而不是表B。
如果我更改代码以先运行query2然后再运行query1,则将其写入表B,但不写入表A。
因此,我了解到它只是在向表中写入时才执行第一个查询。
注意:我尝试分别给两个表提供不同的MySQL用户/数据库。但是没有运气。
有人可以建议吗?如何使其工作。
我的代码如下:
import java.sql._
class JDBCSink1(url:String, user:String, pwd:String) extends ForeachWriter[org.apache.spark.sql.Row] {
val driver = "com.mysql.jdbc.Driver"
var connection:Connection = _
var statement:Statement = _
def open(partitionId: Long,version: Long): Boolean = {
Class.forName(driver)
connection = DriverManager.getConnection(url, user, pwd)
statement = connection.createStatement
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
val insertSql = """ INSERT INTO tableA(col1,col2,col3) VALUES(?,?,?); """
val preparedStmt: PreparedStatement = connection.prepareStatement(insertSql)
preparedStmt.setString (1, value(0).toString)
preparedStmt.setString (2, value(1).toString)
preparedStmt.setString (3, value(2).toString)
preparedStmt.execute
}
def close(errorOrNull: Throwable): Unit = {
connection.close
}
}
class JDBCSink2(url:String, user:String, pwd:String) extends ForeachWriter[org.apache.spark.sql.Row] {
val driver = "com.mysql.jdbc.Driver"
var connection:Connection = _
var statement:Statement = _
def open(partitionId: Long,version: Long): Boolean = {
Class.forName(driver)
connection = DriverManager.getConnection(url, user, pwd)
statement = connection.createStatement
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
val insertSql = """ INSERT INTO tableB(col1,col2) VALUES(?,?); """
val preparedStmt: PreparedStatement = connection.prepareStatement(insertSql)
preparedStmt.setString (1, value(0).toString)
preparedStmt.setString (2, value(1).toString)
preparedStmt.execute
}
def close(errorOrNull: Throwable): Unit = {
connection.close
}
}
val url1="jdbc:mysql://hostname:3306/db1"
val url2="jdbc:mysql://hostname:3306/db2"
val user1 ="usr1"
val user2="usr2"
val pwd = "password"
val Writer1 = new JDBCSink1(url1,user1, pwd)
val Writer2 = new JDBCSink2(url2,user2, pwd)
val query2 =
streamDF2
.writeStream
.foreach(Writer2)
.outputMode("append")
.trigger(ProcessingTime("35 seconds"))
.start().awaitTermination()
val query1 =
streamDF1
.writeStream
.foreach(Writer1)
.outputMode("append")
.trigger(ProcessingTime("30 seconds"))
.start().awaitTermination()
答案 0 :(得分:3)
由于awaitTermination
,您正在阻止第二个查询。如果要有两个输出流,则需要在等待它们终止之前先启动两个流:
val query2 =
streamDF2
.writeStream
.foreach(Writer2)
.outputMode("append")
.trigger(ProcessingTime("35 seconds"))
.start()
val query1 =
streamDF1
.writeStream
.foreach(Writer1)
.outputMode("append")
.trigger(ProcessingTime("30 seconds"))
.start()
query1.awaitTermination()
query2.awaitTermination()
编辑:
Spark还使您可以如Scheduling within an application中所述为不同的流查询调度和分配资源。您可以根据以下条件配置池:
FIFO
或FAIR
可以通过创建类似于conf/fairscheduler.xml.template
的XML文件并在类路径上放置名为fairscheduler.xml的文件或在SparkConf中设置spark.scheduler.allocation.file
属性来设置池配置。
conf.set("spark.scheduler.allocation.file", "/path/to/file")
可以通过以下方式应用不同的池:
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
// In the above example you could then tell Spark to make use of the pools
val query1 = streamDF1.writeStream.[...].start(pool1)
val query2 = streamDF2.writeStream.[...].start(pool2)