Spark textFileStream不读取文件

时间:2016-09-07 07:44:22

标签: apache-spark pyspark

我试图让火花流动起作用。但它不会读取我放在目录中的任何文件。     来自pyspark导入SparkContext     来自pyspark.streaming import StreamingContext

if __name__ == "__main__":
    sc = SparkContext("local[*]", "StreamTest")
    ssc = StreamingContext(sc, 1)
    ssc.checkpoint("checkpoint")

    files = ssc.textFileStream("file:///ApacheSpark/MLlib_testing/Streaming/data")

    words = files.flatMap(lambda line: line.split(" "))
    pairs = words.map(lambda word: (word, 1))
    wordCounts = pairs.reduceByKey(lambda x,y: x+y)
    print "Oled siin ??"

    wordCounts.pprint()

    ssc.start()
    ssc.awaitTermination()

一切正常但没有从文件夹中读取文件。 print命令执行一次,即应用程序启动时。我做错了什么?

我在Windows 10上使用spark 1.6.2。无法让火花2.0.0运行。

编辑1 我添加了一些控制台日志输出。

16/09/07 11:36:57 INFO JobScheduler: Added jobs for time 1473237417000 ms
16/09/07 11:36:57 INFO JobGenerator: Checkpointing graph for time 1473237417000 ms
16/09/07 11:36:57 INFO DStreamGraph: Updating checkpoint data for time 1473237417000 ms
16/09/07 11:36:57 INFO DStreamGraph: Updated checkpoint data for time 1473237417000 ms
16/09/07 11:36:57 INFO CheckpointWriter: Submitted checkpoint of time 1473237417000 ms writer queue
16/09/07 11:36:57 INFO CheckpointWriter: Saving checkpoint for time 1473237417000 ms to file 'file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473237417000'
16/09/07 11:36:57 INFO CheckpointWriter: Deleting file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473233874000.bk
16/09/07 11:36:57 INFO CheckpointWriter: Checkpoint for time 1473237417000 ms saved to file 'file:/C:/Users/Marko/Desktop/ApacheSpark/MLlib_testing/Streaming/checkpoint/checkpoint-1473237417000', took 6071 bytes and 72 ms
16/09/07 11:36:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393
16/09/07 11:36:57 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 82 bytes
16/09/07 11:36:57 INFO DAGScheduler: Got job 1 (runJob at PythonRDD.scala:393) with 3 output partitions
16/09/07 11:36:57 INFO DAGScheduler: Final stage: ResultStage 3 (runJob at PythonRDD.scala:393)
16/09/07 11:36:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
16/09/07 11:36:57 INFO DAGScheduler: Missing parents: List()
16/09/07 11:36:57 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[22] at RDD at PythonRDD.scala:43), which has no missing parents
16/09/07 11:36:57 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.1 KB, free 15.8 KB)
16/09/07 11:36:57 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.5 KB, free 19.3 KB)
16/09/07 11:36:57 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:59483 (size: 3.5 KB, free: 511.1 MB)
16/09/07 11:36:57 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/09/07 11:36:57 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 3 (PythonRDD[22] at RDD at PythonRDD.scala:43)
16/09/07 11:36:57 INFO TaskSchedulerImpl: Adding task set 3.0 with 3 tasks
16/09/07 11:36:57 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 2, localhost, partition 2,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 3, localhost, partition 3,PROCESS_LOCAL, 1986 bytes)
16/09/07 11:36:57 INFO Executor: Running task 0.0 in stage 3.0 (TID 1)
16/09/07 11:36:57 INFO Executor: Running task 1.0 in stage 3.0 (TID 2)
16/09/07 11:36:57 INFO Executor: Running task 2.0 in stage 3.0 (TID 3)
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 9 ms
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 10 ms
16/09/07 11:36:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 23 ms
16/09/07 11:36:58 INFO FileInputDStream: Finding new files took 3 ms
16/09/07 11:36:58 INFO FileInputDStream: New files at time 1473237418000 ms:

0 个答案:

没有答案