pyspark kafka流式偏移

时间:2018-10-05 16:46:33

标签: apache-spark pyspark apache-kafka streaming offset

我从下面的链接中获得了与pyspark中的kafka主题偏移流相关的内容。

addCommandToOverflowMenu

fromOffsets = fromOffset)

参考链接:Spark Streaming kafka offset manage

如果我必须从kafka中读取每个窗口/批处理的最近15分钟数据,我不理解下面要提供的内容。

fromOffset = {topicPartion:long(在此处输入数字偏移)}

1 个答案:

答案 0 :(得分:0)

基本上这是帮助我们管理检查点类型的事情的领域。管理偏移量对于在流过程的整个生命周期中实现数据连续性最有利。例如,在关闭流应用程序或发生意外故障时,偏移范围将丢失,除非将其持久保存在非易失性数据存储中。此外,如果不读取分区的偏移量,Spark Streaming作业将无法继续处理上次中断的数据。这样我们就可以以多种方式处理偏移量。   一种方法是,我们可以将偏移值存储在Zookeeper中,并在创建DSstream时对其进行读取。

from kazoo.client import KazooClient
zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()
ZOOKEEPER_SERVERS = "127.0.0.1:2181"

def get_zookeeper_instance():
    from kazoo.client import KazooClient
    if 'KazooSingletonInstance' not in globals():
        globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
        globals()['KazooSingletonInstance'].start()
    return globals()['KazooSingletonInstance']

def save_offsets(rdd):
    zk = get_zookeeper_instance()
    for offset in rdd.offsetRanges():
        path = f"/consumers/{var_topic_src_name}"
        print(path)
        zk.ensure_path(path)
        zk.set(path, str(offset.untilOffset).encode())

    var_offset_path = f'/consumers/{var_topic_src_name}'

    try:
        var_offset = int(zk.get(var_offset_path)[0])
    except:
        print("The spark streaming started First Time and Offset value should be Zero")
        var_offset  = 0
    var_partition = 0
    enter code here
    topicpartion = TopicAndPartition(var_topic_src_name, var_partition)
    fromoffset = {topicpartion: var_offset}
    print(fromoffset)
    kvs = KafkaUtils.createDirectStream(ssc,\
                                        [var_topic_src_name],\
                                        var_kafka_parms_src,\
                                        valueDecoder=serializer.decode_message,\
                                        fromOffsets = fromoffset)
    kvs.foreachRDD(handler)
    kvs.foreachRDD(save_offsets)

参考:

pySpark Kafka Direct Streaming update Zookeeper / Kafka Offset

致谢

Karthikeyan Rasipalayam Durairaj