将Twitter数据从水槽流到Spark以解决分析问题

时间:2018-09-19 13:45:56

标签: apache-spark pyspark spark-streaming flume flume-twitter

我正在使用官方的flume + spark配置,如文档中所述,但是在注册到主机和端口号后,flume无法成功发送事件。另一方面,火花TID再也不会收到错过的消息。

以下是我的配置:

TwitterAgent1.sources = PublicStream2
TwitterAgent1.channels = fileCh2
TwitterAgent1.sinks = avrosink2
TwitterAgent1.sources.PublicStream2.type = com.cloudsigma.flume.twitter.TwitterSource
TwitterAgent1.sources.PublicStream2.channels = fileCh2
TwitterAgent1.sources.PublicStream2.consumerKey =
TwitterAgent1.sources.PublicStream2.consumerSecret =
TwitterAgent.sources.PublicStream2.accessToken =
TwitterAgent1.sources.PublicStream2.accessTokenSecret =
TwitterAgent1.sources.PublicStream2.keywords = some keywrds
#TwitterAgent1.sources.PublicStream2.locations = -,-
TwitterAgent1.sources.PublicStream2.language = en
TwitterAgent1.sources.PublicStream2.follow =,
TwitterAgent1.sinks.avrosink2.type = avro 
TwitterAgent1.sinks.avrosink2.batch-size = 1 
TwitterAgent1.sinks.avrosink2.hostname = 1x5.3x.3.1x2    -->  IP of host as i am in cluster
TwitterAgent1.sinks.avrosink2.port = 9988 
TwitterAgent1.sinks.avrosink2.channel = fileCh2
TwitterAgent1.channels.fileCh2.type = file
TwitterAgent1.channels.fileCh2.capacity = 10000
TwitterAgent1.channels.fileCh2.transactionCapacity = 10000

pyspark的代码:

try:
# create SparkContext on all CPUs available: in my case I have 4 CPUs on my laptop
conf = SparkConf().setAppName("tweeterAnalysis")
sc = ps.SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print("Just created a SparkContext")
except ValueError:
warnings.warn("SparkContext already exists in this scope")

from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 10)
flumeStream = FlumeUtils.createStream(ssc, "pa.pan.net", 41414)

ssc.start()
ssc.awaitTermination()

错误: 无法传递事件。例外如下。 org.apache.flume.EventDeliveryException:发送事件失败     在org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:389)     在org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)     在org.apache.flume.SinkRunner $ PollingRunner.run(SinkRunner.java:145)     在java.lang.Thread.run(Thread.java:748) 引起原因:org.apache.flume.EventDeliveryException:NettyAvroRpcClient {主机:pan0143.panoulu.net,端口:41414}:无法发送批处理     在org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:314)     在org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:373)     ...还有3个

有人可以帮忙吗?

0 个答案:

没有答案