Kafka:如何根据时间戳使用数据

时间:2018-05-18 06:57:50

标签: python apache-kafka

我想知道在时间间隔方面是否有除偏移之外的某些方法来获取数据?说,我想消耗昨天的所有日期,我该怎么做?

2 个答案:

答案 0 :(得分:0)

您可以在指定时间间隔的开头找到最早的偏移量并回退到此偏移量。但是,很难理解间隔结束的位置,因为具有最早时间戳的记录可能稍后到达。因此,您可以从间隔开始使用记录,直到找到时间戳晚于endTime的记录加上一些记录来捕获延迟消息。

倒带到startTime的代码是:

Date.parse(string)

答案 1 :(得分:0)

使用offsetsForTimes获得与所需时间戳有关的正确偏移量。在Python中,就像下一个一样:

from datetime import datetime
from kafka import KafkaConsumer, TopicPartition

topic  = "www.kilskil.com" 
broker = "localhost:9092"

# lets check messages of the first day in New Year
date_in  = datetime(2019,1,1)
date_out = datetime(2019,1,2)

consumer = KafkaConsumer(topic, bootstrap_servers=broker, enable_auto_commit=True)
consumer.poll()  # we need to read message or call dumb poll before seeking the right position

tp      = TopicPartition(topic, 0) # partition n. 0
# in simple case without any special kafka configuration there is only one partition for each topic channel
# and it's number is 0

# in fact you asked about how to use 2 methods: offsets_for_times() and seek()
rec_in  = consumer.offsets_for_times({tp:date_in.timestamp() * 1000})
rec_out = consumer.offsets_for_times({tp:date_out.timestamp() * 1000})

consumer.seek(tp, rec_in[tp].offset) # lets go to the first message in New Year!

c = 0
for msg in consumer:
  if msg.offset >= rec_out[tp].offset:
    break

  c += 1
  # message also has .timestamp field

print("{c} messages between {_in} and {_out}".format(c=c, _in=str(date_in), _out=str(date_out)))

请不要忘记,Kafka以毫秒为单位测量时间戳,并且具有 long 类型。 Python lib datetime返回时间戳(以秒为单位),因此我们需要将其乘以1000。方法offsets_for_times返回具有TopicPartition键和OffsetAndTimestamp值的字典。