Question

我正在寻找可以将数据从Spark Streaming流式传输到Google Big Query的开源连接器，有没有？

我发现one from Spotify，但它没有得到主动维护，只允许以Avro格式发送记录。

Answer 1

我也需要它，但是我什么都找不到，所以我在依赖中直接添加了 google-cloud-bigquery ，然后：

implicit class RichDStreamMyClass(dstream: DStream[MyClass]) {
  /** Writes the [[DStream]] with [[MyClass]]s to BigQuery.
    * All the records are inserted at once per RDD (= per partition per window).
    */
  def saveToBigQuery(tableRef: Table) =
    dstream.foreachRDD { rdd =>
      rdd.foreachPartition { partition =>
        val rowsToInsert = partition.map(toRowToInsert).toSeq.asJava
        if (!rowsToInsert.isEmpty) {
          val insertResponse = tableRef.insert(rowsToInsert)
          if (insertResponse.hasErrors) 
            logger.error(s"${insertResponse.getInsertErrors.values()}")
      }
    }
  }
}

/** Creates [[RowToInsert]] for BigQuery by mapping the field of a 
  * [[MyClass]]. */
def toRowToInsert(myClass: MyClass): RowToInsert = {
  val fields = Map(
    "timestamp" -> myClass.timestamp,
    "name" -> myClass.name
  ).asJava
  RowToInsert.of(s"${myClass.key}", fields)
}

请注意，插入方法一次最多不能插入10k个元素，所以我也有这样的意思：

val conf = new SparkConf()
  .set("spark.streaming.kafka.maxRatePerPartition",
    (10000 / config.spark.window).toString)

tableRef 是 com.google.cloud.bigquery.Table 的实例。

是否有任何连接从火花流媒体到谷歌大查询？

1 个答案: