我正在开发一个使用kafka并且tech是scala的应用程序。我的kafka客户代码如下:
val props = new Properties()
props.put("group.id", "test")
props.put("bootstrap.servers", "localhost:9092")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("auto.offset.reset", "earliest")
props.put("group.id", "consumer-group")
val consumer: KafkaConsumer[String, String] = new KafkaConsumer[String, String](props)
consumer.subscribe(util.Collections.singletonList(topic))
val record = consumer.poll(Duration.ofMillis(500)).asScala.toList
它给了我所有记录,但问题是我在kafka使用者中已经有数据,这可能导致重复数据,这意味着具有相同密钥的数据可能已经存在于主题中。有什么方法可以检索特定时间的数据。表示在轮询之前是否可以计算当前时间并仅检索该时间之后的记录。我能做到这一点吗?
答案 0 :(得分:1)
从任何给定时间戳消费的唯一方法是
offsetsForTimes
seek
至和commitSync
结果但是,您需要意识到数据流是连续的,以后可能再次出现重复的键。
如果您只想查看最新的数据键,则最好使用KTable
答案 1 :(得分:1)
您可以在KafkaConsumer API中使用offsetsForTimes方法。
import java.time.Duration
import java.util.Properties
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.common.TopicPartition
import collection.JavaConverters._
object OffsetsForTime extends App {
implicit def toJavaOffsetQuery(offsetQuery: Map[TopicPartition, scala.Long]): java.util.Map[TopicPartition, java.lang.Long] =
offsetQuery
.map { case (tp, time) => tp -> new java.lang.Long(time) }
.asJava
val topic = "myInTopic"
val timestamp: Long = 1595971151000L
val props = new Properties()
props.put("group.id", "group-id1337")
props.put("bootstrap.servers", "localhost:9092")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("auto.offset.reset", "earliest")
val consumer: KafkaConsumer[String, String] = new KafkaConsumer[String, String](props)
val topicPartition = new TopicPartition(topic, 0)
consumer.assign(java.util.Collections.singletonList(topicPartition))
// dummy poll before calling seek
consumer.poll(Duration.ofMillis(500))
// get next available offset after given timestamp
val (_, offsetAndTimestamp) = consumer.offsetsForTimes(Map(topicPartition -> timestamp)).asScala.head
// seek to offset
consumer.seek(topicPartition, offsetAndTimestamp.offset())
// poll data
val record = consumer.poll(Duration.ofMillis(500)).asScala.toList
for (data <- record) {
println(s"Timestamp: ${data.timestamp()}, Key: ${data.key()}, Value: ${data.value()}")
}
}
./kafka/current/bin/kafconsole-consumer.sh --bootstrap-server localhost:9092 --topic myInTopic --from-beginning --property print.value=true --property print.timestamp=true
CreateTime:1595971142560 1_old
CreateTime:1595971147697 2_old
CreateTime:1595971150136 3_old
CreateTime:1595971192649 1_new
CreateTime:1595971194489 2_new
CreateTime:1595971196416 3_new
选择时间戳记到3_old
和1_new
之间的某个时间以仅使用“新”消息。
Timestamp: 1595971192649, Key: null, Value: 1_new
Timestamp: 1595971194489, Key: null, Value: 2_new
Timestamp: 1595971196416, Key: null, Value: 3_new