我想通过KafkaUtils.createDirectStream
从任意偏移消耗kafka消息。
我的源代码:
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
def functionToCreateContext():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
kvs = KafkaUtils.createDirectStream(
ssc,
['test123'],
{"metadata.broker.list": "localhost:9092"},
{TopicAndPartition("test123", 0): 100, TopicAndPartition("test123", 1): 100}
)
#kvs = kvs.checkpoint(10)
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
return ssc
if __name__ == "__main__":
ssc = StreamingContext.getOrCreate("./checkpoint", functionToCreateContext())
ssc.start()
ssc.awaitTermination()
但是得到如下错误:
Traceback (most recent call last):
File "/usr/local/spark-1.6.0-bin-hadoop2.6/examples/src/main/python/streaming/direct_kafka_wordcount.py", line 56, in <module>
ssc = StreamingContext.getOrCreate("./checkpoint", functionToCreateContext())
File "/usr/local/spark-1.6.0-bin-hadoop2.6/examples/src/main/python/streaming/direct_kafka_wordcount.py", line 45, in functionToCreateContext
{TopicAndPartition("test123", 0): 100, TopicAndPartition("test123", 1): 100}
TypeError: unhashable type: 'TopicAndPartition'
pyspark源代码:
@staticmethod
def createDirectStream(ssc, topics, kafkaParams, fromOffsets=None,
keyDecoder=utf8_decoder, valueDecoder=utf8_decoder,
messageHandler=None):
class TopicAndPartition(object):
"""
Represents a specific top and partition for Kafka.
"""
def __init__(self, topic, partition):
"""
Create a Python TopicAndPartition to map to the Java related object
:param topic: Kafka topic name.
:param partition: Kafka partition id.
"""
self._topic = topic
self._partition = partition
def _jTopicAndPartition(self, helper):
return helper.createTopicAndPartition(self._topic, self._partition)
.........
jfromOffsets = dict([(k._jTopicAndPartition(helper),
v) for (k, v) in fromOffsets.items()])
fromOffsets应该是一个dict,dict的键应该是TopicAndPartition
个对象。
对此有何想法?
答案 0 :(得分:2)
pyspark
有python3的错误,TopicAndPartition
类缺少hash
方法,因此你应该将python3改为python2,错误就消失了。
然后应该将偏移量从int转换为long:
{TopicAndPartition("test123", 0): long(100), TopicAndPartition("test123", 1): long(100)}