Question

根据Dataproc docos，它具有“与BigQuery的本地和自动集成”。

我在BigQuery中有一个表。我想阅读该表并使用我创建的Dataproc集群（使用PySpark作业）对其进行一些分析。然后将此分析的结果写回BigQuery。您可能会问“为什么不直接在BigQuery中进行分析！？” - 原因是因为我们正在创建复杂的统计模型，而SQL的开发水平太高了。我们需要像Python或R，ergo Dataproc这样的东西。

他们是否有Dataproc + BigQuery示例？我找不到任何。

Answer 1

首先，如this question所述，Cloud Dataproc群集上预先安装了BigQuery连接器。

这是一个关于如何从BigQuery读取数据到Spark的示例。在此示例中，我们将从BigQuery读取数据以执行字数统计。您使用SparkContext.newAPIHadoopRDD从Spark中的BigQuery读取数据。 Spark documentation提供了有关使用SparkContext.newAPIHadoopRDD的更多信息。 “

import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
import com.google.gson.JsonObject

import org.apache.hadoop.io.LongWritable

val projectId = "<your-project-id>"
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val fullyQualifiedOutputTableId = "<your-fully-qualified-table-id>"
val outputTableSchema =
    "[{'name': 'Word','type': 'STRING'},{'name': 'Count','type': 'INTEGER'}]"
val jobName = "wordcount"

val conf = sc.hadoopConfiguration

// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)

// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)

// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
BigQueryConfiguration.configureBigQueryOutput(conf,
    fullyQualifiedOutputTableId, outputTableSchema)

val fieldName = "word"

val tableData = sc.newAPIHadoopRDD(conf,
    classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.cache()
tableData.count()
tableData.map(entry => (entry._1.toString(),entry._2.toString())).take(10)

您需要使用您的设置自定义此示例，包括<your-project-id>中的Cloud Platform项目ID和<your-fully-qualified-table-id>中的输出表ID。

最后，如果您最终使用带有MapReduce的BigQuery连接器，this page提供了有关如何使用BigQuery连接器编写MapReduce作业的示例。

Answer 2

您还可以使用spark-bigquery连接器https://github.com/samelamin/spark-bigquery使用spark直接在dataproc上运行查询。

Answer 3

上面的示例未说明如何将数据写入输出表。你需要这样做：

.saveAsNewAPIHadoopFile(
hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY), 
classOf[String], 
classOf[JsonObject], 
classOf[BigQueryOutputFormat[String, JsonObject]], hadoopConf)

其中key：String实际上被忽略

Dataproc + BigQuery示例 - 任何可用的？

3 个答案: