我有以下代码,假设连接到本地kafka群集,并运行pyspark作业:
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
## Constants
APP_NAME = "PythonStreamingDirectKafkaWordCount"
##OTHER FUNCTIONS/CLASSES
def main(sc):
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
# filename = sys.argv[1]
# Execute Main functionality
main(sc)
当我运行此代码时,出现以下错误:
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=PythonStreamingDirectKafkaWordCount, master=local[*]) created by __init__ at /home/ubuntu/spark-1.3.0-bin-hadoop2.4/hello1.py:30
构建代码的正确方法是什么,以避免此错误?
答案 0 :(得分:1)
根本不要创建SparkContext
两次。如果它是在main
函数内创建的,则没有理由从外部传递它:
def main():
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
...
if __name__ == "__main__":
main()
由于StreamingContext
终止相应的SparkContext
,因此没有充分理由将这两者分开。
SparkContext
还有一个getOrCreate
,可用于创建新的上下文或检索现有的上下文。