使用pyspark开发Spark Streaming但未获得预期结果

时间:2018-12-10 10:13:02

标签: pyspark spark-streaming

我是火花流媒体方面的新手。我开发了一个小型Spark Streaming应用程序。 这里要从目录中读取文件并将输出打印到控制台 (或到文本文件)

下面是我在python中开发的代码

**import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext(appName='PysparkStreaming')
ssc = StreamingContext(sc,3)
lines= ssc.textFileStream('file:///home/cloudera/spark/logs/')
counts=lines.flatMap(lambda line :line.split(" ")).map(lambda x: (x,1)).reduceByKey( lambda a, b: a + b)
counts.pprint()
print(counts)
ssc.start()
ssc.awaitTermination()**

运行类似

的代码
spark-submit as_log_stream.py

按照流中的声明,每3秒钟继续出现以下警告,但预期的输出未显示,即字数。 您能让我知道这是什么错误吗?

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.13.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/10 02:01:36 INFO spark.SparkContext: Running Spark version 1.6.0
18/12/10 02:01:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/12/10 02:01:39 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 192.168.186.133 instead (on interface eth1)
18/12/10 02:01:39 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/12/10 02:01:40 INFO spark.SecurityManager: Changing view acls to: cloudera
18/12/10 02:01:40 INFO spark.SecurityManager: Changing modify acls to: cloudera
18/12/10 02:01:40 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); users with modify permissions: Set(cloudera)
18/12/10 02:01:40 INFO util.Utils: Successfully started service 'sparkDriver' on port 50432.
18/12/10 02:01:41 INFO slf4j.Slf4jLogger: Slf4jLogger started
18/12/10 02:01:41 INFO Remoting: Starting remoting
18/12/10 02:01:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.186.133:45299]
18/12/10 02:01:42 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriverActorSystem@192.168.186.133:45299]
18/12/10 02:01:42 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 45299.
18/12/10 02:01:42 INFO spark.SparkEnv: Registering MapOutputTracker
18/12/10 02:01:42 INFO spark.SparkEnv: Registering BlockManagerMaster
18/12/10 02:01:42 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-78e8d300-dbad-4008-a4ec-339f3599d8a1
18/12/10 02:01:42 INFO storage.MemoryStore: MemoryStore started with capacity 534.5 MB
18/12/10 02:01:43 INFO spark.SparkEnv: Registering OutputCommitCoordinator
18/12/10 02:01:44 INFO server.Server: jetty-8.y.z-SNAPSHOT
18/12/10 02:01:44 WARN component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:444)
    at sun.nio.ch.Net.bind(Net.java:436)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.spark-project.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
    at org.spark-project.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
    at org.spark-project.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
    at org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
    at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$httpConnect$1(JettyUtils.scala:291)
    at org.apache.spark.ui.JettyUtils$$anonfun$7.apply(JettyUtils.scala:295)
    at org.apache.spark.ui.JettyUtils$$anonfun$7.apply(JettyUtils.scala:295)
    at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2040)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
    at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2032)
    at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:295)
    at org.apache.spark.ui.WebUI.bind(WebUI.scala:127)
    at org.apache.spark.SparkContext$$anonfun$14.apply(SparkContext.scala:489)
    at org.apache.spark.SparkContext$$anonfun$14.apply(SparkContext.scala:489)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:489)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)
18/12/10 02:01:44 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
18/12/10 02:01:44 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041
18/12/10 02:01:44 INFO util.Utils: Successfully started service 'SparkUI' on port 4041.
18/12/10 02:01:44 INFO ui.SparkUI: Started SparkUI at http://192.168.186.133:4041
18/12/10 02:01:46 INFO util.Utils: Copying /home/cloudera/practice/spark/scripts/as_log_stream.py to /tmp/spark-a57a538e-e7c7-496b-ad88-8d968ac379e8/userFiles-b449f5bb-434a-47d7-a5bc-ca9eb3f9e001/as_log_stream.py
18/12/10 02:01:46 INFO spark.SparkContext: Added file file:/home/cloudera/practice/spark/scripts/as_log_stream.py at file:/home/cloudera/practice/spark/scripts/as_log_stream.py with timestamp 1544436106336
18/12/10 02:01:48 INFO executor.Executor: Starting executor ID driver on host localhost
18/12/10 02:01:48 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35274.
18/12/10 02:01:48 INFO netty.NettyBlockTransferService: Server created on 35274
18/12/10 02:01:48 INFO storage.BlockManagerMaster: Trying to register BlockManager
18/12/10 02:01:48 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:35274 with 534.5 MB RAM, BlockManagerId(driver, localhost, 35274)
18/12/10 02:01:48 INFO storage.BlockManagerMaster: Registered BlockManager
18/12/10 02:01:51 INFO dstream.FileInputDStream: Duration for remembering RDDs set to 60000 ms for org.apache.spark.streaming.dstream.FileInputDStream@326a8451
<pyspark.streaming.dstream.TransformedDStream object at 0x1111f90>
18/12/10 02:01:52 INFO dstream.ForEachDStream: metadataCleanupDelay = -1
18/12/10 02:01:52 INFO python.PythonTransformedDStream: metadataCleanupDelay = -1
18/12/10 02:01:52 INFO dstream.MappedDStream: metadataCleanupDelay = -1
18/12/10 02:01:52 INFO dstream.FileInputDStream: metadataCleanupDelay = -1
18/12/10 02:01:52 INFO dstream.FileInputDStream: Slide time = 3000 ms
18/12/10 02:01:52 INFO dstream.FileInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
18/12/10 02:01:52 INFO dstream.FileInputDStream: Checkpoint interval = null
18/12/10 02:01:52 INFO dstream.FileInputDStream: Remember duration = 60000 ms
18/12/10 02:01:52 INFO dstream.FileInputDStream: Initialized and validated org.apache.spark.streaming.dstream.FileInputDStream@326a8451
18/12/10 02:01:52 INFO dstream.MappedDStream: Slide time = 3000 ms
18/12/10 02:01:52 INFO dstream.MappedDStream: Storage level = StorageLevel(false, false, false, false, 1)
18/12/10 02:01:52 INFO dstream.MappedDStream: Checkpoint interval = null
18/12/10 02:01:52 INFO dstream.MappedDStream: Remember duration = 3000 ms
18/12/10 02:01:52 INFO dstream.MappedDStream: Initialized and validated org.apache.spark.streaming.dstream.MappedDStream@1b8496f
18/12/10 02:01:52 INFO python.PythonTransformedDStream: Slide time = 3000 ms
18/12/10 02:01:52 INFO python.PythonTransformedDStream: Storage level = StorageLevel(false, false, false, false, 1)
18/12/10 02:01:52 INFO python.PythonTransformedDStream: Checkpoint interval = null
18/12/10 02:01:52 INFO python.PythonTransformedDStream: Remember duration = 3000 ms
18/12/10 02:01:52 INFO python.PythonTransformedDStream: Initialized and validated org.apache.spark.streaming.api.python.PythonTransformedDStream@69dd174a
18/12/10 02:01:52 INFO dstream.ForEachDStream: Slide time = 3000 ms
18/12/10 02:01:52 INFO dstream.ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1)
18/12/10 02:01:52 INFO dstream.ForEachDStream: Checkpoint interval = null
18/12/10 02:01:52 INFO dstream.ForEachDStream: Remember duration = 3000 ms
18/12/10 02:01:52 INFO dstream.ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@32243192
18/12/10 02:01:52 INFO util.RecurringTimer: Started timer for JobGenerator at time 1544436114000
18/12/10 02:01:52 INFO scheduler.JobGenerator: Started JobGenerator at 1544436114000 ms
18/12/10 02:01:52 INFO scheduler.JobScheduler: Started JobScheduler
18/12/10 02:01:52 INFO streaming.StreamingContext: StreamingContext started
18/12/10 02:01:54 INFO dstream.FileInputDStream: Finding new files took 74 ms
18/12/10 02:01:54 INFO dstream.FileInputDStream: New files at time 1544436114000 ms:

18/12/10 02:01:55 INFO scheduler.JobScheduler: Added jobs for time 1544436114000 ms
18/12/10 02:01:55 INFO scheduler.JobScheduler: Starting job streaming job 1544436114000 ms.0 from job set of time 1544436114000 ms
18/12/10 02:01:55 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
18/12/10 02:01:55 INFO scheduler.DAGScheduler: Registering RDD 3 (call at /usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py:1724)
18/12/10 02:01:55 INFO scheduler.DAGScheduler: Got job 0 (runJob at PythonRDD.scala:393) with 1 output partitions
18/12/10 02:01:55 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (runJob at PythonRDD.scala:393)
18/12/10 02:01:55 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/12/10 02:01:55 INFO scheduler.DAGScheduler: Missing parents: List()
18/12/10 02:01:55 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (PythonRDD[7] at RDD at PythonRDD.scala:43), which has no missing parents
18/12/10 02:01:57 INFO dstream.FileInputDStream: Finding new files took 51 ms
18/12/10 02:01:57 INFO dstream.FileInputDStream: New files at time 1544436117000 ms:

18/12/10 02:01:57 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 5.9 KB, free 534.5 MB)
18/12/10 02:01:57 INFO scheduler.JobScheduler: Added jobs for time 1544436117000 ms
18/12/10 02:01:57 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.5 KB, free 534.5 MB)
18/12/10 02:01:57 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:35274 (size: 3.5 KB, free: 534.5 MB)
18/12/10 02:01:57 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1004
18/12/10 02:01:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (PythonRDD[7] at RDD at PythonRDD.scala:43) (first 15 tasks are for partitions Vector(0))
18/12/10 02:01:57 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
18/12/10 02:01:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 1963 bytes)
18/12/10 02:01:57 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 0)
18/12/10 02:01:57 INFO executor.Executor: Fetching file:/home/cloudera/practice/spark/scripts/as_log_stream.py with timestamp 1544436106336
18/12/10 02:01:57 INFO util.Utils: /home/cloudera/practice/spark/scripts/as_log_stream.py has been previously copied to /tmp/spark-a57a538e-e7c7-496b-ad88-8d968ac379e8/userFiles-b449f5bb-434a-47d7-a5bc-ca9eb3f9e001/as_log_stream.py
18/12/10 02:01:57 INFO storage.ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/12/10 02:01:57 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 34 ms
18/12/10 02:01:59 INFO python.PythonRunner: Times: total = 1902, boot = 1608, init = 119, finish = 175
18/12/10 02:02:00 INFO python.PythonRunner: Times: total = 5, boot = -70, init = 75, finish = 0
18/12/10 02:02:00 INFO dstream.FileInputDStream: Finding new files took 36 ms
18/12/10 02:02:00 INFO dstream.FileInputDStream: New files at time 1544436120000 ms:

18/12/10 02:02:00 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 0). 1213 bytes result sent to driver
18/12/10 02:02:00 INFO scheduler.JobScheduler: Added jobs for time 1544436120000 ms
18/12/10 02:02:00 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 2725 ms on localhost (executor driver) (1/1)
18/12/10 02:02:00 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/12/10 02:02:00 INFO scheduler.DAGScheduler: ResultStage 1 (runJob at PythonRDD.scala:393) finished in 2.854 s
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Job 0 finished: runJob at PythonRDD.scala:393, took 5.008004 s
-------------------------------------------
Time: 2018-12-10 02:01:54
-------------------------------------------

18/12/10 02:02:00 INFO scheduler.JobScheduler: Finished job streaming job 1544436114000 ms.0 from job set of time 1544436114000 ms
18/12/10 02:02:00 INFO scheduler.JobScheduler: Total delay: 6.455 s for time 1544436114000 ms (execution: 5.270 s)
18/12/10 02:02:00 INFO scheduler.JobScheduler: Starting job streaming job 1544436117000 ms.0 from job set of time 1544436117000 ms
18/12/10 02:02:00 INFO dstream.FileInputDStream: Cleared 0 old files that were older than 1544436054000 ms: 
18/12/10 02:02:00 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Registering RDD 11 (call at /usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py:1724)
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Got job 1 (runJob at PythonRDD.scala:393) with 1 output partitions
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (runJob at PythonRDD.scala:393)
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Missing parents: List()
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (PythonRDD[22] at RDD at PythonRDD.scala:43), which has no missing parents
18/12/10 02:02:00 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer()
18/12/10 02:02:00 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.9 KB, free 534.5 MB)
18/12/10 02:02:00 INFO scheduler.InputInfoTracker: remove old batch metadata: 
18/12/10 02:02:00 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.5 KB, free 534.5 MB)
18/12/10 02:02:00 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:35274 (size: 3.5 KB, free: 534.5 MB)
18/12/10 02:02:00 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1004
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (PythonRDD[22] at RDD at PythonRDD.scala:43) (first 15 tasks are for partitions Vector(0))
18/12/10 02:02:00 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
18/12/10 02:02:00 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 1963 bytes)
18/12/10 02:02:00 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 1)
18/12/10 02:02:00 INFO storage.ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/12/10 02:02:00 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/12/10 02:02:00 INFO python.PythonRunner: Times: total = 42, boot = -574, init = 616, finish = 0
18/12/10 02:02:00 INFO python.PythonRunner: Times: total = 44, boot = 6, init = 38, finish = 0
18/12/10 02:02:00 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 1). 1213 bytes result sent to driver
18/12/10 02:02:00 INFO scheduler.DAGScheduler: ResultStage 3 (runJob at PythonRDD.scala:393) finished in 0.115 s
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Job 1 finished: runJob at PythonRDD.scala:393, took 0.190777 s
18/12/10 02:02:00 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 1) in 128 ms on localhost (executor driver) (1/1)
18/12/10 02:02:00 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 
-------------------------------------------
Time: 2018-12-10 02:01:57
-------------------------------------------

18/12/10 02:02:00 INFO scheduler.JobScheduler: Finished job streaming job 1544436117000 ms.0 from job set of time 1544436117000 ms
18/12/10 02:02:00 INFO scheduler.JobScheduler: Total delay: 3.815 s for time 1544436117000 ms (execution: 0.335 s)
18/12/10 02:02:00 INFO scheduler.JobScheduler: Starting job streaming job 1544436120000 ms.0 from job set of time 1544436120000 ms
18/12/10 02:02:00 INFO python.PythonRDD: Removing RDD 6 from persistence list
18/12/10 02:02:00 INFO spark.SparkContext: Starting job: runJob at PythonRDD.scala:393
18/12/10 02:02:00 INFO rdd.MapPartitionsRDD: Removing RDD 1 from persistence list
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Registering RDD 18 (call at /usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py:1724)
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Got job 2 (runJob at PythonRDD.scala:393) with 1 output partitions
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Final stage: ResultStage 5 (runJob at PythonRDD.scala:393)
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Missing parents: List()
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (PythonRDD[23] at RDD at PythonRDD.scala:43), which has no missing parents
18/12/10 02:02:00 INFO storage.BlockManager: Removing RDD 6
18/12/10 02:02:00 INFO dstream.FileInputDStream: Cleared 0 old files that were older than 1544436057000 ms: 
18/12/10 02:02:00 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer()
18/12/10 02:02:00 INFO scheduler.InputInfoTracker: remove old batch metadata: 
18/12/10 02:02:00 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 5.9 KB, free 534.5 MB)
18/12/10 02:02:00 INFO storage.BlockManager: Removing RDD 1
18/12/10 02:02:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.5 KB, free 534.5 MB)
18/12/10 02:02:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:35274 (size: 3.5 KB, free: 534.5 MB)
18/12/10 02:02:00 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1004
18/12/10 02:02:00 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (PythonRDD[23] at RDD at PythonRDD.scala:43) (first 15 tasks are for partitions Vector(0))
18/12/10 02:02:00 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
18/12/10 02:02:00 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 2, localhost, executor driver, partition 0, PROCESS_LOCAL, 1963 bytes)
18/12/10 02:02:00 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 2)
18/12/10 02:02:00 INFO storage.ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/12/10 02:02:00 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/12/10 02:02:00 INFO python.PythonRunner: Times: total = 41, boot = -103, init = 144, finish = 0
18/12/10 02:02:01 INFO python.PythonRunner: Times: total = 56, boot = 22, init = 34, finish = 0
18/12/10 02:02:01 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 2). 1213 bytes result sent to driver
18/12/10 02:02:01 INFO scheduler.DAGScheduler: ResultStage 5 (runJob at PythonRDD.scala:393) finished in 0.128 s
18/12/10 02:02:01 INFO scheduler.DAGScheduler: Job 2 finished: runJob at PythonRDD.scala:393, took 0.174702 s
18/12/10 02:02:01 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 2) in 129 ms on localhost (executor driver) (1/1)
18/12/10 02:02:01 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 
-------------------------------------------
Time: 2018-12-10 02:02:00
-------------------------------------------

18/12/10 02:02:01 INFO scheduler.JobScheduler: Finished job streaming job 1544436120000 ms.0 from job set of time 1544436120000 ms
18/12/10 02:02:01 INFO scheduler.JobScheduler: Total delay: 1.045 s for time 1544436120000 ms (execution: 0.230 s)
18/12/10 02:02:01 INFO python.PythonRDD: Removing RDD 14 from persistence list
18/12/10 02:02:01 INFO storage.BlockManager: Removing RDD 14
18/12/10 02:02:01 INFO rdd.MapPartitionsRDD: Removing RDD 9 from persistence list
18/12/10 02:02:01 INFO storage.BlockManager: Removing RDD 9
18/12/10 02:02:01 INFO dstream.FileInputDStream: Cleared 0 old files that were older than 1544436060000 ms: 
18/12/10 02:02:01 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer()
18/12/10 02:02:01 INFO scheduler.InputInfoTracker: remove old batch metadata: 

用于生成文件的脚本:可以正常工作。

from  random import randint
import time
def main():
    createFile()
def createFile():
    print('creating files')
    with open('//home//cloudera//practice//spark//source//server_log_name_12_0008.log', 'r') as logfile:
        loglines=logfile.readlines()
        linecount=0
        while linecount <=70:
            totalline = len(loglines)
            linenumber = randint(0,totalline -10)
            with open('//home//cloudera//spark//logs//log{0}.txt'.format(linecount),'w') as writefile:
                writefile.write(' '.join(line for line in loglines[linenumber:totalline]))
                print('creating file log{0}.txt'.format(linecount))
                linecount+=1
                time.sleep(2)

__name__ == '__main__'
main()

观察:可以看到文件添加到spark流日志中,并将输出设置为驱动程序。输出发送给哪个驱动程序?如何收集文件中的输出?请建议

18/12/12 17:22:10 INFO dstream.FileInputDStream: Finding new files took 4611 ms
18/12/12 17:22:10 WARN dstream.FileInputDStream: Time taken to find new files exceeds the batch size. Consider increasing the batch size or reducing the number of files in the monitored directory.
18/12/12 17:22:10 INFO dstream.FileInputDStream: New files at time 1544664126000 ms:

18/12/12 17:22:11 INFO scheduler.JobScheduler: Added jobs for time 1544664126000 ms
18/12/12 17:22:11 INFO dstream.FileInputDStream: Finding new files took 53 ms
**18/12/12 17:22:11 INFO dstream.FileInputDStream: New files at time 1544664129000 ms:
file:/home/cloudera/practice/spark/logs/log0.txt
file:/home/cloudera/practice/spark/logs/log1.txt**
18/12/12 17:22:11 INFO python.PythonRunner: Times: total = 9183, boot = 8301, init = 262, finish = 620
18/12/12 17:22:11 INFO python.PythonRunner: Times: total = 39, boot = 38, init = 1, finish = 0
**18/12/12 17:22:11 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 0). 1213 bytes result sent to driver**
18/12/12 17:22:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 9720 ms on localhost (executor driver) (1/1)
18/12/12 17:22:12 INFO scheduler.DAGScheduler: ResultStage 1 (runJob at PythonRDD.scala:393) finished in 9.797 s
18/12/12 17:22:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/12/12 17:22:12 INFO scheduler.DAGScheduler: Job 0 finished: runJob at PythonRDD.scala:393, took 11.534252 s

1 个答案:

答案 0 :(得分:0)

我的第一个看法是,您的Spark Streaming程序正在运行时,您不会向file:///home/cloudera/spark/logs/添加任何新文件。

textFileStream仅在作业开始后才拾取新数据。作业运行后,尝试将一些文件复制到目录中。

此外,您确定不使用HDFS吗?我看到了Cloudera和Spark,因此通常表示Hadoop。如果是这样,则需要确保您是hdfs://home/cloudera/spark/logs,或者如果您没有配置Hadoop Namenode,则它应该为hdfs://host:port/home/cloudera/spark/logs/