我想将Flume avro Agent与spark Streaming 1.6.0(HDP 2.4.0.0)连接起来。 Flume代理程序配置文件正常工作:
# Please paste flume.conf here. Example:
# Sources, channels, and sinks are defined per
# agent name, in this case 'clickstream'.
##################################
###declaramos los source channels y sinks
###################################
clickstream.sources = source1
clickstream.channels = channel1
clickstream.sinks = sink1
##################################
#SOURCES: propiedades de los sources
###################################
clickstream.sources.source1.type = spooldir
#The directory from which to read files from.
clickstream.sources.source1.spoolDir = /tmp/flume
#clickstream.sources.source1.command = tail -F /rsiiri/syspri/bdpiba/bdp00340/
clickstream.sources.source1.batchSize = 100
clickstream.sources.source1.channels = channel1
#When to delete completed files: never or immediate
clickstream.sources.source1.deletePolicy = never
clickstream.sources.source1.consumeOrder = youngest
clickstream.sources.source1.inputCharset = UTF-8
clickstream.sources.source1.decodeErrorPolicy = REPLACE
clickstream.sources.source1.deserializer = LINE
clickstream.sources.source1.deserializer.maxLineLength = 2048
#Whether to add a header storing the absolute path filename.
clickstream.sources.source1.fileHeader = false
#Suffix to append to completely ingested files
clickstream.sources.source1.fileSuffix = .COMPLETED
clickstream.channels.channel1.type = memory
clickstream.channels.channel1.capacity = 100
clickstream.channels.channel1.transactionCapacity = 100
##################################
### SINK HDFS: para escribir en el HDFS como sink###
##################################
clickstream.sinks.sink1.type = avro
clickstream.sinks.sink1.channel = channel1
clickstream.sinks.sink1.hostname = HOSTNAMEA
clickstream.sinks.sink1.port = 3333
对于Spark Streaming python代码是这样的:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 5)
lines = ssc.socketTextStream("HOSTNAMEA", 3333)
lines.pprint()
ssc.start()
ssc.awaitTermination()
The error that i can see from spark stremaming is:
16/09/27 07:51:30 INFO JobScheduler: Finished job streaming job 1474955490000 ms.0 from job set of time 1474955490000 ms
16/09/27 07:51:30 INFO BlockRDD: Removing RDD 100 from persistence list
16/09/27 07:51:30 INFO JobScheduler: Total delay: 0.011 s for time 1474955490000 ms (execution: 0.010 s)
16/09/27 07:51:30 INFO BlockManager: Removing RDD 100
16/09/27 07:51:30 INFO SocketInputDStream: Removing blocks of RDD BlockRDD[100] at socketTextStream at NativeMethodAccessorImpl.java:-2 of time 1474955490000 ms
16/09/27 07:51:30 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1474955480000 ms)
16/09/27 07:51:30 INFO InputInfoTracker: remove old batch metadata: 1474955480000 ms
16/09/27 07:51:31 INFO ReceiverTracker: Registered receiver for stream 0 from xxxxxxxxx:40483
16/09/27 07:51:31 ERROR ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to HOSTNAMEA:3333 - java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:211)
at org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:73)
at org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:59)
16/09/27 07:51:33 INFO ReceiverTracker: Registered receiver for stream 0 from xxxxxxx:40483
16/09/27 07:51:33 ERROR ReceiverTracker: Deregistered receiver for stream 0: Restarting receiver with delay 2000ms: Error connecting to HOSTNAMEA:3333 - java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:211)
at org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:73)
at org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:59)
此外,我尝试使用此代码,但我有一个不同的错误
从pyspark.streaming.flume导入FlumeUtils 来自pyspark导入SparkContext 来自pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 5)
flumeStream = FlumeUtils.createStream(ssc, "HOSTNAMEA", 3333)
flumeStream.pprint()
ssc.start()
ssc.awaitTermination()
wit the second code i have diferent error using this jar:
spark-streaming-flume-assembly_2.11-1.6.0.jar
16/09/27 08:06:23 WARN TaskSetManager: Lost task 0.0 in stage 47.0 (TID 160, XXXXXXXXX): org.jboss.netty.channel.ChannelException: Failed to bind to: HOSTNAMEA/HOSTNAMEA:3333
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:106)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:119)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:74)
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:68)
at org.apache.spark.streaming.flume.FlumeReceiver.initServer(FlumeInputDStream.scala:162)
at org.apache.spark.streaming.flume.FlumeReceiver.onStart(FlumeInputDStream.scala:169)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.SparkContext$$anonfun$37.apply(SparkContext.scala:1992)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.BindException: Cannot assign requested address
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:437)
at sun.nio.ch.Net.bind(Net.java:429)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:372)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:296)
at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
... 3 more