我有一个流程,可以针对不同的区域并行地在Dataproc群集上执行spark作业。对于每个区域,它创建一个集群,执行spark作业并在完成后删除集群。
spark作业使用org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset
方法传递BigQuery Configuration来保存BigQuery表上的数据。该作业将数据保存在多个表中,每个作业调用saveAsNewAPIHadoopDataset方法多次。
问题在于,我有时会因为内部创建的Hadoop临时BigQuery数据集中的冲突而导致错误,从而运行作业:
Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 409 Conflict
{
"code" : 409,
"errors" : [ {
"domain" : "global",
"message" : "Already Exists: Dataset <my-gcp-project>:<MY-DATASET>_hadoop_temporary_job_201802250620_0013",
"reason" : "duplicate"
} ],
"message" : "Already Exists: Dataset <my-gcp-project>:<MY-DATASET>_hadoop_temporary_job_201802250620_0013"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.setupJob(BigQueryOutputCommitter.java:107)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1150)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:819)
...
上述例外的时间戳 201802250620_0013 有 _0013 后缀,我不确定它是否代表时间。
我的想法是,有时作业会同时运行并尝试创建名称中具有相同时间戳的数据集。在并行作业中或在另一个saveAsNewAPIHadoopDataset调用的同一作业内。
如何在不延迟执行作业的情况下避免此错误?
我使用的依赖是:
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>bigquery-connector</artifactId>
<version>0.10.2-hadoop2</version>
<scope>provided</scope>
</dependency>
Dataproc图像版本为1.1
修改1:
我尝试使用IndirectBigQueryOutputFormat
,但现在我收到一条错误消息,指出gcs输出路径已经存在,即使在每次saveAsNewAPIHadoopDataset
调用时都传递了不同的时间。
这是我的代码: SparkConf sc = new SparkConf()。setAppName(&#34; MyApp&#34;);
try (JavaSparkContext jsc = new JavaSparkContext(sc)) {
JavaPairRDD<String, String> filesJson = jsc.wholeTextFiles(jsonFolder, parts);
JavaPairRDD<String, String> jsons = filesJson.flatMapToPair(new FileSplitter()).repartition(parts);
JavaPairRDD<Object, JsonObject> objsJson = jsons.flatMapToPair(new JsonParser()).filter(t -> t._2() != null).cache();
objsJson
.filter(new FilterType(MSG_TYPE1))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE1", "gs://my-bucket/tmp1"));
objsJson
.filter(new FilterType(MSG_TYPE2))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE2", "gs://my-bucket/tmp2"));
objsJson
.filter(new FilterType(MSG_TYPE3))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE3", "gs://my-bucket/tmp3"));
// here goes another ingestion process. same code as above but diferrent params, parsers, etc.
}
Configuration createConf(String table, String outGCS) {
Configuration conf = new Configuration();
BigQueryOutputConfiguration.configure(conf, table, null, outGCS, BigQueryFileFormat.NEWLINE_DELIMITED_JSON, TextOutputFormat.class);
conf.set("mapreduce.job.outputformat.class", IndirectBigQueryOutputFormat.class.getName());
return conf;
}
答案 0 :(得分:0)
我相信可能发生的事情是每个映射器都试图创建自己的数据集。这是相当低效的(并且将每日配额与映射器的数量成比例)。
另一种方法是使用IndirectBigQueryOutputFormat
作为输出类:
IndirectBigQueryOutputFormat的工作方式是首先将所有数据缓冲到云存储临时表中,然后在commitJob上,将所有数据从云存储复制到BigQuery中。它的使用推荐用于大型工作,因为它只需要一个BigQuery&#34; load&#34;与BigQueryOutputFormat相比,每个Hadoop / Spark作业的作业,BigQueryOutputFormat为每个Hadoop / Spark任务执行一个BigQuery作业。
请参阅此处的示例:https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
答案 1 :(得分:0)
我尝试将您的代码与教程合并,然后我正在写同一个表。我做了一些改变:
classOf[TextOutputFormat[_,_]]
代替TextOutputFormat.class
conf.set("mapreduce.job.outputformat.class",
classOf[IndirectBigQueryOutputFormat[_,_]].getName)
代替conf.set("mapreduce.job.outputformat.class", IndirectBigQueryOutputFormat.class.getName());
它似乎对我很好。我在执行中取消一个作业并再次重新运行时遇到了同样的错误。我附上我的完整代码(它可以改进 - 我重复了相同的代码块三次而不是使用函数 - 但希望这会有所帮助。
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
import com.google.cloud.hadoop.io.bigquery.BigQueryFileFormat
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.output.BigQueryOutputConfiguration
import com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
// Marked as transient since configuration is not Serializable. This should
// only be necessary in spark-shell REPL.
@transient
val conf = sc.hadoopConfiguration
// Input parameters.
val fullyQualifiedInputTableId = "publicdata:samples.shakespeare"
val projectId = "PROJECT_ID"
val bucket = "BUCKET_NAME"
// Input configuration.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, bucket)
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
// Helper to convert JsonObjects to (word, count) tuples.
def convertToTuple(record: JsonObject) : (String, Long) = {
val word = record.get("word").getAsString.toLowerCase
val count = record.get("word_count").getAsLong
return (word, count)
}
// Helper to convert (word, count) tuples to JsonObjects.
def convertToJson(pair: (String, Long)) : JsonObject = {
val word = pair._1
val count = pair._2
val jsonObject = new JsonObject()
jsonObject.addProperty("word", word)
jsonObject.addProperty("word_count", count)
return jsonObject
}
// Load data from BigQuery.
val tableData = sc.newAPIHadoopRDD(
conf,
classOf[GsonBigQueryInputFormat],
classOf[LongWritable],
classOf[JsonObject])
// Perform word count.
val wordCounts = (tableData
.map(entry => convertToTuple(entry._2))
.reduceByKey(_ + _))
// Display 10 results.
wordCounts.take(10).foreach(l => println(l))
// Write data back into a new BigQuery table.
// IndirectBigQueryOutputFormat discards keys, so set key to null.
BigQueryOutputConfiguration.configure(conf, "PROJECT_ID:wordcount_dataset.multiple1", null, "gs://BUCKET_NAME/hadoop/tmp/bigquery/multiple1", BigQueryFileFormat.NEWLINE_DELIMITED_JSON, classOf[TextOutputFormat[_,_]]);
conf.set("mapreduce.job.outputformat.class",
classOf[IndirectBigQueryOutputFormat[_,_]].getName)
// Truncate the table before writing output to allow multiple runs.
conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,
"WRITE_TRUNCATE")
(wordCounts
.map(pair => (null, convertToJson(pair)))
.saveAsNewAPIHadoopDataset(conf))
BigQueryOutputConfiguration.configure(conf, "PROJECT_ID:wordcount_dataset.multiple2", null, "gs://BUCKET_NAME/hadoop/tmp/bigquery/multiple2", BigQueryFileFormat.NEWLINE_DELIMITED_JSON, classOf[TextOutputFormat[_,_]]);
conf.set("mapreduce.job.outputformat.class",
classOf[IndirectBigQueryOutputFormat[_,_]].getName)
(wordCounts
.map(pair => (null, convertToJson(pair)))
.saveAsNewAPIHadoopDataset(conf))
BigQueryOutputConfiguration.configure(conf, "PROJECT_ID:wordcount_dataset.multiple3", null, "gs://BUCKET_NAME/hadoop/tmp/bigquery/multiple3", BigQueryFileFormat.NEWLINE_DELIMITED_JSON, classOf[TextOutputFormat[_,_]]);
conf.set("mapreduce.job.outputformat.class",
classOf[IndirectBigQueryOutputFormat[_,_]].getName)
(wordCounts
.map(pair => (null, convertToJson(pair)))
.saveAsNewAPIHadoopDataset(conf))