我想使用pyspark流来计算/predix/test/
内文件的单词,并将输出保存在/predix/output/
中。同时控制台打印出单词计数,例如:{hello: 5}
。
下面是代码,但控制台从不提供输出{hello: 5}
。有人能指出我的错误在哪里吗?
感谢。
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
conf = SparkConf().setMaster("local[2]")
sc = SparkContext(appName='streamingWordsCount', conf=conf)
ssc = StreamingContext(sc, 5) # batch interval in seconds 5
lines = ssc.textFileStream("/predix/test")
words = lines.flatMap(lambda line: line.split(" "))
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
wordCounts.pprint()
wordCounts.saveASTextFile("/predix/output")
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate