我想知道在时间间隔方面是否有除偏移之外的某些方法来获取数据?说,我想消耗昨天的所有日期,我该怎么做?
答案 0 :(得分:0)
您可以在指定时间间隔的开头找到最早的偏移量并回退到此偏移量。但是,很难理解间隔结束的位置,因为具有最早时间戳的记录可能稍后到达。因此,您可以从间隔开始使用记录,直到找到时间戳晚于endTime的记录加上一些记录来捕获延迟消息。
倒带到startTime的代码是:
Date.parse(string)
答案 1 :(得分:0)
使用offsetsForTimes获得与所需时间戳有关的正确偏移量。在Python中,就像下一个一样:
from datetime import datetime
from kafka import KafkaConsumer, TopicPartition
topic = "www.kilskil.com"
broker = "localhost:9092"
# lets check messages of the first day in New Year
date_in = datetime(2019,1,1)
date_out = datetime(2019,1,2)
consumer = KafkaConsumer(topic, bootstrap_servers=broker, enable_auto_commit=True)
consumer.poll() # we need to read message or call dumb poll before seeking the right position
tp = TopicPartition(topic, 0) # partition n. 0
# in simple case without any special kafka configuration there is only one partition for each topic channel
# and it's number is 0
# in fact you asked about how to use 2 methods: offsets_for_times() and seek()
rec_in = consumer.offsets_for_times({tp:date_in.timestamp() * 1000})
rec_out = consumer.offsets_for_times({tp:date_out.timestamp() * 1000})
consumer.seek(tp, rec_in[tp].offset) # lets go to the first message in New Year!
c = 0
for msg in consumer:
if msg.offset >= rec_out[tp].offset:
break
c += 1
# message also has .timestamp field
print("{c} messages between {_in} and {_out}".format(c=c, _in=str(date_in), _out=str(date_out)))
请不要忘记,Kafka以毫秒为单位测量时间戳,并且具有 long 类型。 Python lib datetime返回时间戳(以秒为单位),因此我们需要将其乘以1000。方法offsets_for_times
返回具有TopicPartition
键和OffsetAndTimestamp
值的字典。