Spark Streaming App无法从Kafka接收消息

时间:2017-08-28 14:23:42

标签: python apache-spark pyspark apache-kafka

我使用以下python生成器将一些msg发布到我的kafka主题(我也可以使用jupyter中的python使用者完美地接收我发布的数据)。

from kafka import KafkaProducer
import json,time
userdata={
        "ipaddress": "172.16.0.57",
        "logtype": "",
        "mid": "",
        "name":"TJ"
}
producer = KafkaProducer(bootstrap_servers=['localhost:9092'],value_serializer=lambda v: json.dumps(v).encode('utf-8'))

for i in range(10):
    print("adding",i)
    producer.send('test', userdata)
    time.sleep(3)

但是当我尝试在spark中运行kafkastreaming示例时,我没有得到任何东西(我应该注意到我的工作站中的spark是可操作的,因为我可以运行网络流示例而没有任何问题):

from __future__ import print_function
from pyspark.streaming.kafka import KafkaUtils
import sys
import os 
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import json


os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.10:2.0.2 pyspark-shell'  

sc = SparkContext("local[2]", "KafkaSTREAMWordCount")
ssc = StreamingContext(sc, 2)
kafka_stream = KafkaUtils.createStream(ssc,"localhost:2181","raw-event-streaming-consumer",{"test":1})

parsed = kafka_stream.map(lambda (k, v): json.loads(v))
parsed.pprint()
ssc.start()
ssc.awaitTermination()

以下是输出样本:

-------------------------------------------
Time: 2017-08-28 14:08:32
-------------------------------------------

-------------------------------------------
Time: 2017-08-28 14:08:33
-------------------------------------------

-------------------------------------------
Time: 2017-08-28 14:08:34
-------------------------------------------

注意:我的系统规格如下:

Ubuntu 16.04 Spark:spark-2.2.0-bin-hadoop2.7 Jupyter笔记本(python 2.7) 卡夫卡:kafka_2.11-0.11.0.0

我的.bashrc中有以下几行:

export PATH="/home/myubuntu/anaconda3/bin:$PATH"

export PATH="/home/myubuntu/Desktop/spark-2.2.0-bin-hadoop2.7/bin:$PATH"

export PATH="/home/myubuntu/Desktop/spark-2.2.0-bin-hadoop2.7/jars:$PATH"

export PATH="/home/myubuntu/Desktop/spark-2.2.0-bin-hadoop2.7/python:$PATH"

export PATH="/home/myubuntu/Desktop/spark-2.2.0-bin-hadoop2.7/python/pyspark:$PATH"

export PATH="/home/myubuntu/Desktop/spark-2.2.0-bin-hadoop2.7/python/pyspark/streaming:$PATH"


function snotebook () 
{
#Spark path (based on your computer)
SPARK_PATH=~/spark-2.0.0-bin-hadoop2.7

export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

# For python 3 users, you have to add the line below or you will get an error 
#export PYSPARK_PYTHON=python3

#$SPARK_PATH/bin/pyspark --master local[2]
/home/myubuntu/Desktop/spark-2.2.0-bin-hadoop2.7/bin/pyspark  --master local[2]
}

1 个答案:

答案 0 :(得分:0)

我发现了错误。使用spark spark-2.2.0-bin-hadoop2.7,我们需要使用以下jar:

--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0