Bigquery-Spark连接器:

时间:2019-05-31 10:12:31

标签: apache-spark apache-spark-sql google-bigquery

使用spark bigquery连接器将数据写入Bigquery时。正在遇到这个问题。如果从文件读取数据,那么连接器会将数据写入bigquery表。但是,仅当从Cassandra表中读取数据时,它才会引发上述错误。 读取文件后和从Cassandra读取后,我会检查类型。两者都正确指向spark.sql.Dataframe类型。

19/05/31 10:02:32 INFO com.google.cloud.hadoop.io.bigquery.BigQueryHelper: No import schema provided, auto detecting schema.
19/05/31 10:02:39 ERROR org.apache.spark.internal.io.SparkHadoopWriter: Aborting job job_20190531100218_0006.
java.io.IOException: Error during BigQuery job execution: {"location":"query","message":"Schema has no fields. Table: orders_output_e4c96db3_d224_46ca_aef7_5b3fd0f19162_source","reason":"invalidQuery"}
        at com.google.cloud.hadoop.io.bigquery.BigQueryUtils.waitForJobCompletion(BigQueryUtils.java:108)
        at com.google.cloud.hadoop.io.bigquery.BigQueryHelper.importFromGcs(BigQueryHelper.java:234)
        at com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputCommitter.commitJob(IndirectBigQueryOutputCommitter.java:73)
        at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166)
        at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:94)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081)
        at com.hm.CassandraBigquery$.main(CassandraBigquery.scala:41)
        at com.hm.CassandraBigquery.main(CassandraBigquery.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

下面是基于链接here

的大型查询配置
BigQueryOutputConfiguration.configureWithAutoSchema(
      conf,
      outputTableId,
      outputGcsPath,
      BigQueryFileFormat.NEWLINE_DELIMITED_JSON,
      classOf[TextOutputFormat[_,_]])

    conf.set("mapreduce.job.outputformat.class",
      classOf[IndirectBigQueryOutputFormat[_,_]].getName)

    conf.set(BigQueryConfiguration.OUTPUT_TABLE_WRITE_DISPOSITION_KEY,
      "WRITE_APPEND")

请帮助。预先感谢。

1 个答案:

答案 0 :(得分:0)

根据错误“未提供导入模式,自动检测模式”,看来BQ无法识别该模式。文档中提到了有关此“ BigQuery makes a best-effort”的内容。

从这种意义上讲,如果您手动设置模式,它应该可以工作。您可以通过传递以下参数之一来使用the method configure

  • BigQueryTableSchema
  • 包含模式的JSON格式的字符串。

在第二个链接中,您将看到“ avro stores the schema in the file”,这很可能就是将avro文件写入GCS的原因。