数据源io.pivotal.greenplum.spark.GreenplumRelationProvider不支持流式写入

时间:2019-04-04 13:51:35

标签: scala apache-kafka spark-streaming greenplum

我正在尝试从kafka读取数据,并使用spark将其上传到greenplum数据库中。我正在使用greenplum-spark连接器,但正在获取数据源io.pivotal.greenplum.spark.GreenplumRelationProvider不支持流写入。 greenplum源不支持流数据吗?我在网站上看到的是“连续ETL管道(流)”。

我尝试将数据源作为“ greenplum”和“ io.pivotal.greenplum.spark.GreenplumRelationProvider”提供给.format(“ datasource”)

val EventStream = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", args(0))
  .option("subscribe", args(1))
  .option("startingOffsets", "earliest")
  .option("failOnDataLoss", "false")
  .load

val gscWriteOptionMap = Map(
  "url" -> "link for greenplum",
  "user" -> "****",
  "password" -> "****",
  "dbschema" -> "dbname"
)
val stateEventDS = EventStream
  .selectExpr("CAST(key AS String)", "*****(value)")
  .as[(String, ******)]
  .map(_._2)

val EventOutputStream = stateEventDS.writeStream
  .format("io.pivotal.greenplum.spark.GreenplumRelationProvider")
  .options(gscWriteOptionMap)
  .start()

assetEventOutputStream.awaitTermination()

2 个答案:

答案 0 :(得分:1)

您使用的是哪个版本的GPDB / Spark? 您可以绕过火花,而采用Greenplum-Kafka连接器。

TEditor 1.0.3

在早期版本中,Greenplum-Spark连接器公开了一个名为io.pivotal.greenplum.spark.GreenplumRelationProvider的Spark数据源,以将数据从Greenplum数据库读取到Spark DataFrame中。

在更高版本中,连接器公开一个名为greenplum的Spark数据源,以在Spark和Greenplum数据库之间传输数据。

应该像-

val EventOutputStream = stateEventDS.write.format(“ greenplum”)       .options(gscWriteOptionMap)       .save()

请参阅:https://gpdb.docs.pivotal.io/5170/greenplum-kafka/overview.html

答案 1 :(得分:0)

Greenplum Spark结构化流

演示如何通过JDBC在GPDB中使用writeStream API

以下代码块使用速率流源进行读取,并使用基于JDBC的接收器将数据流批量传输到GPDB

基于批处理的流

import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext

import scala.concurrent.duration._

val sq = spark.
  readStream.
  format("rate").
  load.
  writeStream.
  format("myjdbc").
  option("checkpointLocation", "/tmp/jdbc-checkpoint").
  trigger(Trigger.ProcessingTime(10.seconds)).
  start

基于记录的流式传输

这使用了ForeachWriter

import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext

import scala.concurrent.duration._

val url="jdbc:postgresql://gsc-dev:5432/gpadmin"
val user ="gpadmin"
val pwd = "changeme"
val jdbcWriter = new JDBCSink(url,user, pwd)

val sq = spark.
  readStream.
  format("rate").
  load.
  writeStream.
  format(jdbcWriter).
  option("checkpointLocation", "/tmp/jdbc-checkpoint").
  trigger(Trigger.ProcessingTime(10.seconds)).
  start