我有一个Spark Streaming应用程序,它接收来自水槽中的数据,经过一些转换后,它会写在Hbase上。
但是要进行这些转换,我需要从配置单元表中查询一些数据。然后问题开始了。
我不能在转换内部使用sqlContext或hiveContext(它们不可序列化),并且当我在转换外部编写代码时,它只能运行一次。
如何使此代码在每个流批处理中运行?
def TB_PARAMETRIZACAO_TGC(sqlContext: HiveContext): Map[String,(String,String)] = {
val df_consulta = sqlContext.sql("SELECT TGC,TIPO,DESCRICAO FROM dl_prepago.TB_PARAMETRIZACAO_TGC")
val resultado = df_consulta.map(x => x(Consulta_TB_PARAMETRIZACAO_TGC.TGC.id).toString
-> (x(Consulta_TB_PARAMETRIZACAO_TGC.TIPO.id).toString, x(Consulta_TB_PARAMETRIZACAO_TGC.DESCRICAO.id).toString)).collectAsMap()
resultado
}
答案 0 :(得分:0)
请尝试以下这种非常简单的方法,请注意可以缓存静态JOIN表,并且它们不应太大,否则静态将需要是KV Store LKP,例如Hbase:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.OutputMode
object StreamJoinStatic {
case class Sales(
transactionId: String,
customerId: String,
itemId: String,
amountPaid: Double)
case class Customer(customerId: String, customerName: String)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local") // Not recommended
.appName("exampleStaticJoinStrStr")
.getOrCreate()
//create stream from socket
val socketStreamDf = sparkSession.readStream
.format("socket")
.option("host", "localhost")
.option("port", 50050)
.load()
import sparkSession.implicits._
//take customer data as static df from where ever
val customerDs = sparkSession.read
.format("csv")
.option("header", true)
.load("src/main/resources/customers.csv")
.as[Customer]
import sparkSession.implicits._
val dataDf = socketStreamDf.as[String].flatMap(value ? value.split(" "))
val salesDs = dataDf
.as[String]
.map(value ? {
val values = value.split(",")
Sales(values(0), values(1), values(2), values(3).toDouble)
})
val joinedDs = salesDs.join(customerDs, "customerId")
val query = joinedDs.writeStream.format("console").outputMode(OutputMode.Append())
query.start().awaitTermination()
}
}
然后适应您的具体情况。