我正在尝试从kafka读取数据,并使用spark将其上传到greenplum数据库中。我正在使用greenplum-spark连接器,但正在获取数据源io.pivotal.greenplum.spark.GreenplumRelationProvider不支持流写入。 greenplum源不支持流数据吗?我在网站上看到的是“连续ETL管道(流)”。
我尝试将数据源作为“ greenplum”和“ io.pivotal.greenplum.spark.GreenplumRelationProvider”提供给.format(“ datasource”)
val EventStream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", args(0))
.option("subscribe", args(1))
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load
val gscWriteOptionMap = Map(
"url" -> "link for greenplum",
"user" -> "****",
"password" -> "****",
"dbschema" -> "dbname"
)
val stateEventDS = EventStream
.selectExpr("CAST(key AS String)", "*****(value)")
.as[(String, ******)]
.map(_._2)
val EventOutputStream = stateEventDS.writeStream
.format("io.pivotal.greenplum.spark.GreenplumRelationProvider")
.options(gscWriteOptionMap)
.start()
assetEventOutputStream.awaitTermination()
答案 0 :(得分:1)
您使用的是哪个版本的GPDB / Spark? 您可以绕过火花,而采用Greenplum-Kafka连接器。
在早期版本中,Greenplum-Spark连接器公开了一个名为io.pivotal.greenplum.spark.GreenplumRelationProvider的Spark数据源,以将数据从Greenplum数据库读取到Spark DataFrame中。
在更高版本中,连接器公开一个名为greenplum的Spark数据源,以在Spark和Greenplum数据库之间传输数据。
应该像-
val EventOutputStream = stateEventDS.write.format(“ greenplum”) .options(gscWriteOptionMap) .save()
请参阅:https://gpdb.docs.pivotal.io/5170/greenplum-kafka/overview.html
答案 1 :(得分:0)
演示如何通过JDBC在GPDB中使用writeStream API
以下代码块使用速率流源进行读取,并使用基于JDBC的接收器将数据流批量传输到GPDB
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import scala.concurrent.duration._
val sq = spark.
readStream.
format("rate").
load.
writeStream.
format("myjdbc").
option("checkpointLocation", "/tmp/jdbc-checkpoint").
trigger(Trigger.ProcessingTime(10.seconds)).
start
这使用了ForeachWriter
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import scala.concurrent.duration._
val url="jdbc:postgresql://gsc-dev:5432/gpadmin"
val user ="gpadmin"
val pwd = "changeme"
val jdbcWriter = new JDBCSink(url,user, pwd)
val sq = spark.
readStream.
format("rate").
load.
writeStream.
format(jdbcWriter).
option("checkpointLocation", "/tmp/jdbc-checkpoint").
trigger(Trigger.ProcessingTime(10.seconds)).
start