使用kafka connect

时间:2017-09-22 08:51:27

标签: apache-kafka connect confluent

我正在使用kafka connect distribution。 命令是:bin / connect-distributed etc / schema-registry / connect-avro-distributed.properties

工作人员配置是:


    bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092
    group.id=connect-cluster
    key.converter=org.apache.kafka.connect.json.JsonConverter
    value.converter=org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable=false
    value.converter.schemas.enable=false

kafka connect重启但没有错误!

创建了主题connect-configs,connect-offsets,connect-status。 已经创建了主题mysiteview。

然后我使用RESTful API创建kafka连接器,如下所示:


    curl -X POST -H "Content-Type: application/json" --data '{"name":"hdfs-sink-mysiteview","config":{"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector","tasks.max":"3","topics":"mysiteview","hdfs.url":"hdfs://master1:8020","topics.dir":"/kafka/topics","logs.dir":"/kafka/logs","format.class":"io.confluent.connect.hdfs.avro.AvroFormat","flush.size":"1000","rotate.interval.ms":"1000","partitioner.class":"io.confluent.connect.hdfs.partitioner.DailyPartitioner","path.format":"YYYY-MM-dd","schema.compatibility":"BACKWARD","locale":"zh_CN","timezone":"Asia/Shanghai"}}'  http://kafka1:8083/connectors

当我将数据生成到主题为“mysiteview”的东西时:


    {"f1":"192.168.1.1","f2":"aa.example.com"}

java代码如下:

Properties props = new Properties();
props.put("bootstrap.servers","kafka1:9092");
props.put("acks","all");
props.put("retries",3);
props.put("batch.size", 16384);
props.put("linger.ms",30);
props.put("buffer.memory",33554432);
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<String,String>(props);
Random rnd = new Random();
for(long nEvents = 0; nEvents < events; nEvents++) {
    long runtime = new Date().getTime();
    String site = "www.example.com";
    String ipString = "192.168.2." + rnd.nextInt(255);
    String key = "" + rnd.nextInt(255);
    User u = new User();
    u.setF1(ipString);
    u.setF2(site+" "+rnd.nextInt(255));
    System.out.println(JSON.toJSONString(u));
    producer.send(new ProducerRecord<String,String>("mysiteview",JSON.toJSONString(u)));
    Thread.sleep(50);
}

producer.flush();
producer.close();

Properties props = new Properties(); props.put("bootstrap.servers","kafka1:9092"); props.put("acks","all"); props.put("retries",3); props.put("batch.size", 16384); props.put("linger.ms",30); props.put("buffer.memory",33554432); props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); Producer<String, String> producer = new KafkaProducer<String,String>(props); Random rnd = new Random(); for(long nEvents = 0; nEvents < events; nEvents++) { long runtime = new Date().getTime(); String site = "www.example.com"; String ipString = "192.168.2." + rnd.nextInt(255); String key = "" + rnd.nextInt(255); User u = new User(); u.setF1(ipString); u.setF2(site+" "+rnd.nextInt(255)); System.out.println(JSON.toJSONString(u)); producer.send(new ProducerRecord<String,String>("mysiteview",JSON.toJSONString(u))); Thread.sleep(50); } producer.flush(); producer.close();

发生了奇怪的事情。 我从kafka-logs获取数据但在hdfs中没有数据(没有主题目录)。 我尝试连接器命令:


    curl -X GET http://kafka1:8083/connectors/hdfs-sink-mysiteview/status

输出是:

    {"name":"hdfs-sink-mysiteview","connector":{"state":"RUNNING","worker_id":"10.255.223.178:8083"},"tasks":[{"state":"RUNNING","id":0,"worker_id":"10.255.223.178:8083"},{"state":"RUNNING","id":1,"worker_id":"10.255.223.178:8083"},{"state":"RUNNING","id":2,"worker_id":"10.255.223.178:8083"}]}

但是当我使用以下命令检查任务状态时:

    curl -X GET http://kafka1:8083/connectors/hdfs-sink-mysiteview/hdfs-sink-siteview-1

我得到了结果:“错误404”。三个任务是同样的错误!

出了什么问题?

2 个答案:

答案 0 :(得分:0)

在没有看到工作人员的日志的情况下,当您使用上述设置时,我不确定您的HDFS连接器实例确实失败了。但是我可以发现配置的一些问题:

  1. 您提到您使用以下代码启动Connect工作人员:import tkinter as tk from random import choice root = tk.Tk() bits = 8 rows = 8 numbers = [0, 1] matrix_locations = [(x, y + 1) for x in range(bits) for y in range(rows)] matrix_texts = [] matrix_labels = [] result_locations = [(x, y + 1) for x in range(bits, bits + 1) for y in range(rows)] result_texts = [] result_labels = [] def generate_labels(): for index, location in enumerate(matrix_locations): matrix_texts.append(tk.StringVar()) matrix_texts[index].set(choice(numbers)) matrix_labels.append(tk.Label(root, textvariable = matrix_texts[index])) matrix_labels[index].grid(row = location[1], column = location[0], padx = 2, pady = 2) for index, location in enumerate(result_locations): result_texts.append(tk.StringVar()) result_texts[index].set("") result_labels.append(tk.Label(root, textvariable = result_texts[index])) result_labels[index].grid(row = location[1], column = location[0], padx = 2, pady = 2, sticky = "W") read_matrix() def update_matrix(): for text in matrix_texts: text.set(choice(numbers)) read_matrix() def read_matrix(): string_binaries = ["" for row in range(rows)] for index, location in enumerate(matrix_locations): string_binaries[location[1] - 1] += str(matrix_texts[index].get()) denaries = [int(string, base = 2) for string in string_binaries] for row in range(rows): result_texts[row].set(" = {}".format(denaries[row])) binary_label = tk.Label(root, text = "Binaries") binary_label.grid(row = 0, column = 0, columnspan = bits) denary_label = tk.Label(root, text = "Denaries") denary_label.grid(row = 0, column = bits) generate_labels() update_matrix_button = tk.Button(root, text = "Update matrix", command = update_matrix) update_matrix_button.grid(row = rows + 1, column = 0, columnspan = bits + 1, sticky = "EW") root.mainloop() 。这些属性默认将键和值转换器设置为bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties,并要求您运行AvroConverter服务。如果您确实已在schema-registry中编辑了配置以使用connect-avro-distributed.properties,则在将Kafka记录转换为Connect的JsonConverter数据类型之前,您的HDFS连接器可能会失败,就在它尝试之前将数据导出到HDFS。
  2. 直到最近,HDFS连接器才能将Avro记录导出为Avro或Parquet格式的文件。这需要使用上面提到的SinkRecord。最近添加了将记录导出为文本文件作为JSON的功能,并且将显示在连接器的版本AvroConverter中(您可以通过签出并从源代码构建连接器来尝试此功能)。
  3. 此时,我的第一个建议是尝试使用4.0.0导入您的数据。定义其架构,确认已使用bin/kafka-avro-console-producer成功导入数据,然后将HDFS连接器设置为使用bin/kafka-avro-console-consumer,如上所述。连接器页面上的quickstart描述了一个非常类似的过程,也许它将是您用例的一个很好的起点。

答案 1 :(得分:0)

也许你只是在使用REST-Api错误。 根据文件,呼叫应该是 /connectors/:connector_name/tasks/:task_id

https://docs.confluent.io/3.3.1/connect/restapi.html#get--connectors-(string-name)-tasks-(int-taskid)-status