Kafka-Connect HDFS-Protobuf到镶木地板

时间:2019-02-17 09:07:05

标签: hadoop apache-kafka hdfs protocol-buffers parquet

我正在尝试使用kafka-connect-hdfs,但似乎不起作用。

我试图弄乱设置,但似乎没有任何效果。

这是Protobuf消息架构:

syntax = "proto3";
package com.company;
option java_package = "com.company";
option java_outer_classname = "MyObjectData";
import public "wrappers.proto";
message MyObject {
int64 site_id = 1;
string time_zone = 2;
uint64 dev_id = 3;
uint64 rep_id = 4;
uint64 dev_sn = 5;
UInt64Value timestamp = 6;
UInt32Value secs = 7;
UInt64Value man_id = 8;
FloatValue panv = 9;
FloatValue outputv = 10;
FloatValue panelc = 11;
FloatValue ereset = 12;
FloatValue temp = 13;
FloatValue tempin = 14;
FloatValue tempout = 15;
UInt32Value sectelem = 16;
FloatValue energytelem = 17;
UInt32Value ecode = 18;

}

connect-standalone.properties如下:

bootstrap.servers=k1:9092,k2:9092,k3:9092


key.converter=org.apache.kafka.connect.storage.StringConverter

value.converter=com.blueapron.connect.protobuf.ProtobufConverter
value.converter.protoClassName=com.company.MyObjectData$MyObject
key.converter.schemas.enable=false
value.converter.schemas.enable=true

offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000

plugin.path=/usr/share/java

而quickstart-hdfs.properties是:

name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=ObjectTopic
hadoop.conf.dir=/etc/hadoop
hdfs.url=hdfs://hdp-01:8020/user/hdfs/telems
hadoop.home=/etc/hadoop/client
flush.size=3
key.converter=org.apache.kafka.connect.storage.StringConverter

value.converter=com.blueapron.connect.protobuf.ProtobufConverter
value.converter.protoClassName=com.company.MyObjectData$MyObject

format.class=io.confluent.connect.hdfs.parquet.ParquetFormat

transforms=SetSchemaName
transforms.SetSchemaName.type=org.apache.kafka.connect.transforms.SetSchemaMetadata$Value
transforms.SetSchemaName.schema.name=com.acme.avro.MyObject

当前我收到以下错误:

  

org.apache.kafka.connect.errors.ConnectException:正在退出   由于无法恢复的异常,WorkerSinkTask。           在org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:586)           在org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322)           在org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)           在org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)           在org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)           在org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)           在java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:511)           在java.util.concurrent.FutureTask.run(FutureTask.java:266)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)           在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)           在java.lang.Thread.run(Thread.java:748)导致原因:org.apache.avro.SchemaParseException:无法重新定义:   io.confluent.connect.avro.ConnectDefault           在org.apache.avro.Schema $ Names.put(Schema.java:1128)           在org.apache.avro.Schema $ NamedSchema.writeNameRef(Schema.java:562)           在org.apache.avro.Schema $ RecordSchema.toJson(Schema.java:690)           在org.apache.avro.Schema $ UnionSchema.toJson(Schema.java:882)           在org.apache.avro.Schema $ RecordSchema.fieldsToJson(Schema.java:716)           在org.apache.avro.Schema $ RecordSchema.toJson(Schema.java:701)           在org.apache.avro.Schema.toString(Schema.java:324)           在org.apache.avro.Schema.toString(Schema.java:314)           在org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:133)           在org.apache.parquet.hadoop.ParquetWriter。(ParquetWriter.java:270)           在org.apache.parquet.hadoop.ParquetWriter。(ParquetWriter.java:222)           在org.apache.parquet.hadoop.ParquetWriter。(ParquetWriter.java:188)           在org.apache.parquet.avro.AvroParquetWriter。(AvroParquetWriter.java:131)           在org.apache.parquet.avro.AvroParquetWriter。(AvroParquetWriter.java:106)           在io.confluent.connect.hdfs.parquet.ParquetRecordWriterProvider $ 1.write(ParquetRecordWriterProvider.java:75)           在io.confluent.connect.hdfs.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:643)           在io.confluent.connect.hdfs.TopicPartitionWriter.write(TopicPartitionWriter.java:379)           在io.confluent.connect.hdfs.DataWriter.write(DataWriter.java:375)           在io.confluent.connect.hdfs.HdfsSinkTask.put(HdfsSinkTask.java:109)           在org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:564)

此外,如果有关系,我可以使用用户hdfs

这是架构问题吗?看来我什至都没有改变错误消息...

1 个答案:

答案 0 :(得分:0)

Can't redefine: io.confluent.connect.avro.ConnectDefault可能是因为您的转换正在设置架构属性。

您也可以尝试使用AvroFormat,它将使用Connect的内部Schema&Struct对象并写入HDFS中的Avro文件。

请注意,ParquetFormat使用parquet-avro project,因此数据可能应该以Avro开头。

注意Stacktrace。

  

org.apache.avro.SchemaParseException ...

...

  

org.apache.avro.Schema $ RecordSchema上的org.apache.avro.Schema $ NamedSchema.writeNameRef(Schema.java:562)上的org.apache.avro.Schema $ Names.put(Schema.java:1128) org.apache.avro.Schema $ UnionSchema.toJson(Schema.java:882)(位于org.apache.avro.Schema $ RecordSchema.fieldsToJson(Schema.java:716)处的.toJson(Schema.java:690)。 org.apache.avro.Schema.toString(Schema.java:324)上的apache.avro.Schema $ RecordSchema.toJson(Schema.java:701)org.apache.avro.Schema.toString(Schema.java:314)在org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:133)在org.apache.parquet.hadoop.ParquetWriter。(ParquetWriter.java:270)在org.apache.parquet.hadoop.ParquetWriter。(ParquetWriter .java:222)(位于org.apache.parquet.hadoop.ParquetWriter。(ParquetWriter.java:188),位于org.apache.parquet.avro.AvroParquetWriter。(AvroParquetWriter.java:131)位于org.apache.parquet.avro。 AvroParquetWriter。(AvroParquetWriter.java:106)

因此,您将需要在某个地方编写一个protofuf-avro转换器。也许使用skeuomorph

  1. 生产者和Connect之间的Kafka流或类似过程(这些选项中最简单的一种)
  2. 修改kafka-connect-hdfs项目,以便可以处理Protobuf
  3. 修改ProtobufConverter代码,使其生成ConnectRecord的Avro数据

如果其他所有方法都失败了,则可以file an issue进行一下操作,然后看得到什么。