我使用Spark sql DataSet将数据写入hive。如果模式相同,它的工作完美但如果我更改了avro模式,在其间添加新列,则显示错误(模式由模式注册表提供)
Error running job streaming job 1519289340000 ms.0
org.apache.spark.sql.AnalysisException: The column number of the existing table default.sample(struct<collection_timestamp:bigint,managed_object_id:string,managed_object_type:string,if_admin_status:string,date:string,hour:int,quarter:bigint>) doesn't match the data schema(struct<collection_timestamp:bigint,managed_object_id:string,if_oper_status:string,managed_object_type:string,if_admin_status:string,date:string,hour:int,quarter:bigint>);
if_oper_status
是必须添加的新列。请建议。
StructType struct = convertSchemaToStructType(SchemaRegstryClient.getLatestSchema("simple"));
Dataset<Row> dataset = getSparkInstance().createDataFrame(newRDD, struct);
dataset=dataset.withColumn("date",functions.date_format(functions.current_date(), "dd-MM-yyyy"));
dataset=dataset.withColumn("hour",functions.hour(functions.current_timestamp()));
dataset=dataset.withColumn("quarter",functions.floor(functions.minute(functions.current_timestamp()).divide(5)));
dataset
.coalesce(1)
.write().mode(SaveMode.Append)
.option("charset", "UTF8")
.partitionBy("date","hour","quarter")
.option("checkpointLocation", "/tmp/checkpoint")
.saveAsTable("sample");
答案 0 :(得分:1)
我能够通过将架构从注册表保存到文件中并提供如下的avro.schema.url =文件路径来解决此问题。
注意:必须在saveAsTable("sample")
dataset.sqlContext().sql("CREATE EXTERNAL TABLE IF NOT EXISTS sample PARTITIONED BY (dt STRING, hour STRING, quarter STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://localhost:9000/user/root/sample' TBLPROPERTIES ('avro.schema.url'='file://"+file.getAbsolutePath()+"')");
答案 1 :(得分:0)
请参阅链接:https://github.com/databricks/spark-avro/pull/155。每次提交历史记录,支持不断发展的Avro架构的PR已添加到3.1版。什么是你在代码中使用的spark-avro版本?