Question

我有一个流程，可以针对不同的区域并行地在Dataproc群集上执行spark作业。对于每个区域，它创建一个集群，执行spark作业并在完成后删除集群。

spark作业使用org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset方法传递BigQuery Configuration来保存BigQuery表上的数据。该作业将数据保存在多个表中，每个作业调用saveAsNewAPIHadoopDataset方法多次。

问题在于，我有时会因为内部创建的Hadoop临时BigQuery数据集中的冲突而导致错误，从而运行作业：

Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 409 Conflict
{
 "code" : 409,
 "errors" : [ {
   "domain" : "global",
   "message" : "Already Exists: Dataset <my-gcp-project>:<MY-DATASET>_hadoop_temporary_job_201802250620_0013",
   "reason" : "duplicate"
 } ],
 "message" : "Already Exists: Dataset <my-gcp-project>:<MY-DATASET>_hadoop_temporary_job_201802250620_0013"
}
    at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.setupJob(BigQueryOutputCommitter.java:107)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1150)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
    at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:819)
    ...

上述例外的时间戳 201802250620_0013 有 _0013 后缀，我不确定它是否代表时间。

我的想法是，有时作业会同时运行并尝试创建名称中具有相同时间戳的数据集。在并行作业中或在另一个saveAsNewAPIHadoopDataset调用的同一作业内。

如何在不延迟执行作业的情况下避免此错误？

我使用的依赖是：

<dependency>
    <groupId>com.google.cloud.bigdataoss</groupId>
    <artifactId>bigquery-connector</artifactId>
    <version>0.10.2-hadoop2</version>
    <scope>provided</scope>
</dependency>

Dataproc图像版本为1.1

修改1：

我尝试使用IndirectBigQueryOutputFormat，但现在我收到一条错误消息，指出gcs输出路径已经存在，即使在每次saveAsNewAPIHadoopDataset调用时都传递了不同的时间。

这是我的代码： SparkConf sc = new SparkConf（）。setAppName（＆＃34; MyApp＆＃34;）;

try (JavaSparkContext jsc = new JavaSparkContext(sc)) {
    JavaPairRDD<String, String> filesJson = jsc.wholeTextFiles(jsonFolder, parts);
    JavaPairRDD<String, String> jsons = filesJson.flatMapToPair(new FileSplitter()).repartition(parts);
    JavaPairRDD<Object, JsonObject> objsJson = jsons.flatMapToPair(new JsonParser()).filter(t -> t._2() != null).cache();

    objsJson
    .filter(new FilterType(MSG_TYPE1))
    .saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE1", "gs://my-bucket/tmp1"));

    objsJson
    .filter(new FilterType(MSG_TYPE2))
    .saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE2", "gs://my-bucket/tmp2"));

    objsJson
    .filter(new FilterType(MSG_TYPE3))
    .saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE3", "gs://my-bucket/tmp3"));

    // here goes another ingestion process. same code as above but diferrent params, parsers, etc.
}

Configuration createConf(String table, String outGCS) {
  Configuration conf = new Configuration();
  BigQueryOutputConfiguration.configure(conf, table, null, outGCS, BigQueryFileFormat.NEWLINE_DELIMITED_JSON, TextOutputFormat.class);
  conf.set("mapreduce.job.outputformat.class", IndirectBigQueryOutputFormat.class.getName());
  return conf;
}

Answer 1

我相信可能发生的事情是每个映射器都试图创建自己的数据集。这是相当低效的（并且将每日配额与映射器的数量成比例）。

另一种方法是使用IndirectBigQueryOutputFormat作为输出类：

IndirectBigQueryOutputFormat的工作方式是首先将所有数据缓冲到云存储临时表中，然后在commitJob上，将所有数据从云存储复制到BigQuery中。它的使用推荐用于大型工作，因为它只需要一个BigQuery＆＃34; load＆＃34;与BigQueryOutputFormat相比，每个Hadoop / Spark作业的作业，BigQueryOutputFormat为每个Hadoop / Spark任务执行一个BigQuery作业。

请参阅此处的示例：https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

Answer 2

我尝试将您的代码与教程合并，然后我正在写同一个表。我做了一些改变：

classOf[TextOutputFormat[_,_]]代替TextOutputFormat.class
conf.set("mapreduce.job.outputformat.class", classOf[IndirectBigQueryOutputFormat[_,_]].getName)代替conf.set("mapreduce.job.outputformat.class", IndirectBigQueryOutputFormat.class.getName());

它似乎对我很好。我在执行中取消一个作业并再次重新运行时遇到了同样的错误。我附上我的完整代码（它可以改进 - 我重复了相同的代码块三次而不是使用函数 - 但希望这会有所帮助。

import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.BigQueryFileFormat
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.output.BigQueryOutputConfiguration
import com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat

// Marked as transient since configuration is not Serializable. This should
// only be necessary in spark-shell REPL.
@transient
val conf = sc.hadoopConfiguration

// Input parameters.
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val projectId = "PROJECT_ID"
val bucket = "BUCKET_NAME"

// Input configuration.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, bucket)
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)

// Helper to convert JsonObjects to (word, count) tuples.
def convertToTuple(record: JsonObject) : (String, Long) = {
  val word = record.get("word").getAsString.toLowerCase
  val count = record.get("word_count").getAsLong
  return (word, count)
}

// Helper to convert (word, count) tuples to JsonObjects.
def convertToJson(pair: (String, Long)) : JsonObject = {
  val word = pair._1
  val count = pair._2
  val jsonObject = new JsonObject()
  jsonObject.addProperty("word", word)
  jsonObject.addProperty("word_count", count)
  return jsonObject
}

// Load data from BigQuery.
val tableData = sc.newAPIHadoopRDD(
    conf,
    classOf[GsonBigQueryInputFormat],
    classOf[LongWritable],
    classOf[JsonObject])

// Perform word count.
val wordCounts = (tableData
    .map(entry => convertToTuple(entry._2))
    .reduceByKey(_ + _))

// Display 10 results.
wordCounts.take(10).foreach(l => println(l))

// Write data back into a new BigQuery table.
// IndirectBigQueryOutputFormat discards keys, so set key to null.

BigQueryOutputConfiguration.configure(conf, "PROJECT_ID:wordcount_dataset.multiple1", null, "gs://BUCKET_NAME/hadoop/tmp/bigquery/multiple1", BigQueryFileFormat.NEWLINE_DELIMITED_JSON, classOf[TextOutputFormat[_,_]]); 
conf.set("mapreduce.job.outputformat.class",
         classOf[IndirectBigQueryOutputFormat[_,_]].getName)

// Truncate the table before writing output to allow multiple runs.
conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,
         "WRITE_TRUNCATE")

(wordCounts
    .map(pair => (null, convertToJson(pair)))
    .saveAsNewAPIHadoopDataset(conf))

BigQueryOutputConfiguration.configure(conf, "PROJECT_ID:wordcount_dataset.multiple2", null, "gs://BUCKET_NAME/hadoop/tmp/bigquery/multiple2", BigQueryFileFormat.NEWLINE_DELIMITED_JSON, classOf[TextOutputFormat[_,_]]); 
conf.set("mapreduce.job.outputformat.class",
         classOf[IndirectBigQueryOutputFormat[_,_]].getName)

(wordCounts
    .map(pair => (null, convertToJson(pair)))
    .saveAsNewAPIHadoopDataset(conf))

BigQueryOutputConfiguration.configure(conf, "PROJECT_ID:wordcount_dataset.multiple3", null, "gs://BUCKET_NAME/hadoop/tmp/bigquery/multiple3", BigQueryFileFormat.NEWLINE_DELIMITED_JSON, classOf[TextOutputFormat[_,_]]); 
conf.set("mapreduce.job.outputformat.class",
         classOf[IndirectBigQueryOutputFormat[_,_]].getName)

(wordCounts
    .map(pair => (null, convertToJson(pair)))
    .saveAsNewAPIHadoopDataset(conf))

hadoop临时表

2 个答案: