使用Python将Apache Kafka与Apache Spark Streaming集成

时间:2015-05-18 08:15:29

标签: python apache-spark apache-kafka spark-streaming

我正在尝试使用Python将Apache Kafka与Apache spark流集成(我是所有这些新手)。

为此,我已完成以下步骤

  1. 已启动Zookeeper
  2. 启动Apache Kafka
  3. 在Apache Kafka中添加了主题
  4. 使用此命令管理列出可用主题
  5.   

    bin / kafka-topics.sh --list --zookeeper localhost:2181

    1. 我从这里采取了卡夫卡字数统计代码
    2. https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py

      ,代码是

      from __future__ import print_function
      
      import sys
      
      from pyspark import SparkContext
      from pyspark.streaming import StreamingContext
      from pyspark.streaming.kafka import KafkaUtils
      
      if __name__ == "__main__":
          if len(sys.argv) != 3:
              print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
              exit(-1)
      
          sc = SparkContext(appName="PythonStreamingKafkaWordCount")
          ssc = StreamingContext(sc, 1)
      
          zkQuorum, topic = sys.argv[1:]
          kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
          lines = kvs.map(lambda x: x[1])
          counts = lines.flatMap(lambda line: line.split(" ")) \
              .map(lambda word: (word, 1)) \
              .reduceByKey(lambda a, b: a+b)
          counts.pprint()
      
          ssc.start()
          ssc.awaitTermination()
      
      1. 我使用命令
      2. 执行了代码
          

        ./ spark-submit /root/girish/python/kafkawordcount.py localhost:2181   

        我收到了这个错误

        Traceback (most recent call last):
          File "/root/girish/python/kafkawordcount.py", line 28, in <module>
            kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
          File "/root/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041/python/pyspark/streaming/kafka.py", line 72, in createStream
            raise e
        py4j.protocol.Py4JJavaError: An error occurred while calling o23.loadClass.
        : java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper
                at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
                at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
                at java.security.AccessController.doPrivileged(Native Method)
                at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
                at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
                at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:606)
                at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
                at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
                at py4j.Gateway.invoke(Gateway.java:259)
                at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
                at py4j.commands.CallCommand.execute(CallCommand.java:79)
                at py4j.GatewayConnection.run(GatewayConnection.java:207)
                at java.lang.Thread.run(Thread.java:745)
        
        1. 我已使用此问题的答案更新了执行代码
        2. spark submit failed with spark streaming workdcount python code

           ./spark-submit --jars /root/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041/lib/spark-streaming-kafka_2.10-1.3.1.jar,/usr/hdp/2.2.0.0-2041/kafka/libs/kafka_2.10-0.8.1.2.2.0.0-2041.jar,/usr/hdp/2.2.0.0-2041/kafka/libs/zkclient-0.3.jar,/usr/hdp/2.2.0.0-2041/kafka/libs/metrics-core-2.2.0.jar  /root/girish/python/kafkawordcount.py localhost:2181 <topic name>
          

          现在我收到此错误

          File "/root/girish/python/kafkawordcount.py", line 28, in <module>
              kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
            File "/root/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041/python/pyspark/streaming/kafka.py", line 67, in createStream
              jstream = helper.createStream(ssc._jssc, kafkaParams, topics, jlevel)
            File "/root/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 529, in __call__
            File "/root/spark-1.2.0.2.2.0.0-82-bin-2.6.0.2.2.0.0-2041/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 265, in get_command_part
          AttributeError: 'dict' object has no attribute '_get_object_id'
          

          请帮助解决此问题。

          提前致谢

          PS:我正在使用Apache Spark 1.2

2 个答案:

答案 0 :(得分:1)

面对同样的问题,通过添加kafka-assembly软件包修复

bin/spark-submit  --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.1 ~/py/sparkjob.py

根据你的spark和kafka版本使用。

答案 1 :(得分:0)

使用Apache Spark 1.3解决了问题,它对Python的支持比1.2版更好