运行kafka-spark流式集成以实时获取数据。 代码:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
#set auto.offset.reset = smallest
sc = SparkContext(appName="PythonStreamingDirectKafka")
ssc = StreamingContext(sc, 3600)
brokers = *****
topic = ******
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
lines.pprint()
lines.saveAsTextFiles('/tmp/')
ssc.start()
ssc.awaitTermination()
使用此命令在后台运行作业:
/ usr / bin / spark-submit --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.2 --master yarn get_stream.py> stream.log 2>& 1&
这是从spark-stream作业生成的stream.log。这项工作将在3-4小时后自动关闭。 我在TRACE模式日志记录中遇到的错误是(不显示整个日志,它太大了):
16/08/11 09:56:09 INFO StreamingContext: Invoking stop(stopGracefully=false) from shutdown hook
16/08/11 09:56:09 DEBUG JobScheduler: Stopping JobScheduler
16/08/11 09:56:09 INFO JobGenerator: Stopping JobGenerator immediately
16/08/11 09:56:09 INFO RecurringTimer: Stopped timer for JobGenerator after time 1470906000000
16/08/11 09:56:09 INFO JobGenerator: Stopped JobGenerator
16/08/11 09:56:09 DEBUG JobScheduler: Stopping job executor
16/08/11 09:56:09 DEBUG JobScheduler: Stopped job executor
16/08/11 09:56:09 INFO JobScheduler: Stopped JobScheduler
16/08/11 09:56:09 INFO StreamingContext: StreamingContext stopped successfully
16/08/11 09:56:09 INFO SparkContext: Invoking stop() from shutdown hook
16/08/11 09:56:09 DEBUG DFSClient: DFSClient writeChunk allocating new packet seqno=29, src=/var/log/spark/apps/application_1470897979038_0002.inprogress, packetSize=65016, chunksPerPacket=126, bytesCurBlock=64512
16/08/11 09:56:09 DEBUG DFSClient: DFSClient flush(): bytesCurBlock=64944 lastFlushOffset=64878 createNewBlock=false
16/08/11 09:56:09 DEBUG DFSClient: Queued packet 29
16/08/11 09:56:09 DEBUG DFSClient: Waiting for ack for: 29
16/08/11 09:56:09 TRACE Tracer: setting current span null
16/08/11 09:56:09 DEBUG DFSClient: DataStreamer block BP-730701491-10.102.224.120-1470897963878:blk_1073741871_1047 sending packet packet seqno: 29 offsetInBlock: 64512 lastPacketInBlock: false lastByteOffsetInBlock: 64944
16/08/11 09:56:09 DEBUG DFSClient: DFSClient seqno: 29 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
16/08/11 09:56:09 INFO SparkUI: Stopped Spark web UI at http://10.102.224.120:4040
16/08/11 09:56:09 DEBUG DFSClient: DFSClient writeChunk allocating new packet seqno=30, src=/var/log/spark/apps/application_1470897979038_0002.inprogress, packetSize=65016, chunksPerPacket=126, bytesCurBlock=64512
16/08/11 09:56:09 DEBUG DFSClient: Queued packet 30
16/08/11 09:56:09 DEBUG DFSClient: Queued packet 31
16/08/11 09:56:09 DEBUG DFSClient: Waiting for ack for: 31
16/08/11 09:56:09 TRACE Tracer: setting current span null
16/08/11 09:56:09 DEBUG DFSClient: DataStreamer block BP-730701491-10.102.224.120-1470897963878:blk_1073741871_1047 sending packet packet seqno: 30 offsetInBlock: 64512 lastPacketInBlock: false lastByteOffsetInBlock: 64944
16/08/11 09:56:09 DEBUG DFSClient: DFSClient seqno: 30 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
16/08/11 09:56:09 TRACE Tracer: setting current span null
16/08/11 09:56:09 DEBUG DFSClient: DataStreamer block BP-730701491-10.102.224.120-1470897963878:blk_1073741871_1047 sending packet packet seqno: 31 offsetInBlock: 64944 lastPacketInBlock: true lastByteOffsetInBlock: 64944
16/08/11 09:56:09 DEBUG DFSClient: DFSClient seqno: 31 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
16/08/11 09:56:09 DEBUG DFSClient: Closing old block BP-730701491-10.102.224.120-1470897963878:blk_1073741871_1047
16/08/11 09:56:09 TRACE ProtobufRpcEngine: 46: Call -> ip-10-102-224-120.ec2.internal/10.102.224.120:8020: complete {src: "/var/log/spark/apps/application_1470897979038_0002.inprogress" clientName: "DFSClient_NONMAPREDUCE_258672080_15" last { poolId: "BP-730701491-10.102.224.120-1470897963878" blockId: 1073741871 generationStamp: 1047 numBytes: 64944 } fileId: 16590}
16/08/11 09:56:09 DEBUG Client: The ping interval is 60000 ms.
16/08/11 09:56:09 DEBUG Client: Connecting to ip-10-102-224-120.ec2.internal/10.102.224.120:8020
16/08/11 09:56:09 DEBUG Client: IPC Client (461299828) connection to ip-10-102-224-120.ec2.internal/10.102.224.120:8020 from hadoop: starting, having connections 2
16/08/11 09:56:09 DEBUG Client: IPC Client (461299828) connection to ip-10-102-224-120.ec2.internal/10.102.224.120:8020 from hadoop sending #10767
16/08/11 09:56:09 DEBUG Client: IPC Client (461299828) connection to ip-10-102-224-120.ec2.internal/10.102.224.120:8020 from hadoop got value #10767
16/08/11 09:56:09 DEBUG ProtobufRpcEngine: Call: complete took 3ms
16/08/11 09:56:09 TRACE ProtobufRpcEngine: 46: Response <- ip-10-102-224-120.ec2.internal/10.102.224.120:8020: complete {result: true}
16/08/11 09:56:09 TRACE ProtobufRpcEngine: 46: Call -> ip-10-102-224-120.ec2.internal/10.102.224.120:8020: getFileInfo {src: "/var/log/spark/apps/application_1470897979038_0002"}
16/08/11 09:56:09 DEBUG Client: IPC Client (461299828) connection to ip-10-102-224-120.ec2.internal/10.102.224.120:8020 from hadoop sending #10768
16/08/11 09:56:09 DEBUG Client: IPC Client (461299828) connection to ip-10-102-224-120.ec2.internal/10.102.224.120:8020 from hadoop got value #10768
16/08/11 09:56:09 DEBUG ProtobufRpcEngine: Call: getFileInfo took 1ms
16/08/11 09:56:09 TRACE ProtobufRpcEngine: 46: Response <- ip-10-102-224-120.ec2.internal/10.102.224.120:8020: getFileInfo {}
16/08/11 09:56:09 TRACE ProtobufRpcEngine: 46: Call -> ip-10-102-224-120.ec2.internal/10.102.224.120:8020: rename {src: "/var/log/spark/apps/application_1470897979038_0002.inprogress" dst: "/var/log/spark/apps/application_1470897979038_0002"}
16/08/11 09:56:09 DEBUG Client: IPC Client (461299828) connection to ip-10-102-224-120.ec2.internal/10.102.224.120:8020 from hadoop sending #10769
16/08/11 09:56:09 DEBUG Client: IPC Client (461299828) connection to ip-10-102-224-120.ec2.internal/10.102.224.120:8020 from hadoop got value #10769
16/08/11 09:56:09 DEBUG ProtobufRpcEngine: Call: rename took 2ms
16/08/11 09:56:09 TRACE ProtobufRpcEngine: 46: Response <- ip-10-102-224-120.ec2.internal/10.102.224.120:8020: rename {result: true}
16/08/11 09:56:09 INFO YarnClientSchedulerBackend: Shutting down all executors
16/08/11 09:56:09 INFO YarnClientSchedulerBackend: Interrupting monitor thread
16/08/11 09:56:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down
16/08/11 09:56:09 DEBUG AbstractService: Service: org.apache.hadoop.yarn.client.api.impl.YarnClientImpl entered state STOPPED
16/08/11 09:56:09 DEBUG Client: stopping client from cache: org.apache.hadoop.ipc.Client@7aa30390
16/08/11 09:56:09 INFO YarnClientSchedulerBackend: Stopped
16/08/11 09:56:09 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/08/11 09:56:09 INFO MemoryStore: MemoryStore cleared
16/08/11 09:56:09 INFO BlockManager: BlockManager stopped
16/08/11 09:56:09 INFO BlockManagerMaster: BlockManagerMaster stopped
16/08/11 09:56:09 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/08/11 09:56:09 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/08/11 09:56:09 INFO SparkContext: Successfully stopped SparkContext
16/08/11 09:56:09 INFO ShutdownHookManager: Shutdown hook called
16/08/11 09:56:09 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-fb5326d4-f089-4aff-b394-bc126f12a983/pyspark-f6c4c7f7-f6e5-4dcf-a9cd-cf03391413d9
16/08/11 09:56:09 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/08/11 09:56:09 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-fb5326d4-f089-4aff-b394-bc126f12a983
16/08/11 09:56:09 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-fb5326d4-f089-4aff-b394-bc126f12a983/httpd-9104e927-bd6d-4338-bd20-fd01b7d3ce7f
16/08/11 09:56:09 DEBUG Client: stopping client from cache: org.apache.hadoop.ipc.Client@7aa30390
答案 0 :(得分:0)
您可能正在以纱线客户端模式运行程序,即驱动程序位于提交主机上。
查看您的日志文件,您会注意到客户端已关闭:
Invoking stop() from shutdown hook
最有可能由封闭shell调用,因为您的会话已终止。将作业发送到后台并不会阻止此操作,因为该过程仍然与其父级(即会话)绑定。
除此之外,您还可以使用控制台输出来收集结果,这是您不应该做的,特别是因为您已经在HDFS中收集了相同的记录:
lines.saveAsTextFiles('/tmp/')
我建议以下方法来解决这个问题:
a)以集群模式运行。将--deploy-mode cluster
添加到您的参数中
b)如果您仍想收集输出,请像在火花提交前一样添加nohup
。 nohup
会将您的流程与父流程分开。