flume - 将数据从Twitter API流式传输到HDFS

时间:2017-05-25 17:34:59

标签: apache hadoop twitter cloudera flume

我花了几周时间使用Flume( Flume 1.5.0-cdh5.4.3)将Twitter数据传输到Hadoop( Hadoop 2.6.0-cdh5.4.3 )服务器上)在运行CentOS 6.6的本地虚拟机中。

最初,我尝试使用内置的Twitter源作为Flume上的默认库,但它工作正常,但数据显然没有正确编码,后来这被Cloudera团队确认为已知问题( https://community.cloudera.com/t5/Data-Ingestion-Integration/Flume-Twitter-data-looks-corrupt/td-p/48095)。

此方法使用类名 org.apache.flume.source.twitter.TwitterSource ,在flume.conf文件中正确设置。

后来,我尝试使用可以在Cloudera的网站(http://files.cloudera.com/samples/flume-sources-1.0-SNAPSHOT.jar)上找到的自定义Twitter JAR源代码以及许多其他教程,但这导致了另一个问题,同时从接收状态流Twitter API,应用程序陷入困境。

2017-05-25 09:51:37,875 (Twitter Stream consumer-1[initializing]) 
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 
Establishing connection.
2017-05-25 09:52:10,545 (Twitter Stream consumer-1[Establishing connection]) 
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 
Connection established.
2017-05-25 09:52:10,546 (Twitter Stream consumer-1[Establishing connection]) 
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 
Receiving status stream.

此方法使用了类名 com.cloudera.flume.source.TwitterSource ,据报道该名称已不再有效,已被 org.apache.flume.source.twitter.TwitterSource <替换/ strong>即可。 所需文件的设置如下所示:

1) flume.conf

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource #(OLD CLASS)
#TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource #(NEW CLASS)
TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>
TwitterAgent.sources.Twitter.keywords = hadoop, big data

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 100000
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
TwitterAgent.sinks.HDFS.hdfs.callTimeout = 30000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

2) flume-env.sh

JAVA_HOME=/usr/java/jdk1.7.0_67
JAVA_OPTS="-Xmx500m"
FLUME_CLASSPATH="/usr/lib/flume-ng/lib/flume-twitter-source.jar"

那么,对于发生了什么的任何想法?任何人都可以根据此设置运行Flume吗?是否有一种将Flume连接到我不知道的Twitter API的新方法? 谢谢!

0 个答案:

没有答案