pyspark流式字数统计示例无控制台输出

时间:2018-03-06 23:24:53

标签: pyspark spark-streaming word-count

我想使用pyspark流来计算/predix/test/内文件的单词,并将输出保存在/predix/output/中。同时控制台打印出单词计数,例如:{hello: 5}

下面是代码,但控制台从不提供输出{hello: 5}。有人能指出我的错误在哪里吗?

感谢。

import findspark
findspark.init()

from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext

if __name__ == "__main__":
    conf = SparkConf().setMaster("local[2]")
    sc = SparkContext(appName='streamingWordsCount', conf=conf)
    ssc = StreamingContext(sc, 5)   # batch interval in seconds 5
    lines = ssc.textFileStream("/predix/test")  
    words = lines.flatMap(lambda line: line.split(" "))
    pairs = words.map(lambda word: (word, 1))
    wordCounts = pairs.reduceByKey(lambda x, y: x + y)

    wordCounts.pprint()
    wordCounts.saveASTextFile("/predix/output")

    ssc.start()             # Start the computation
    ssc.awaitTermination()  # Wait for the computation to terminate

0 个答案:

没有答案