我花了几周时间使用Flume( Flume 1.5.0-cdh5.4.3)将Twitter数据传输到Hadoop( Hadoop 2.6.0-cdh5.4.3 )服务器上)在运行CentOS 6.6的本地虚拟机中。
最初,我尝试使用内置的Twitter源作为Flume上的默认库,但它工作正常,但数据显然没有正确编码,后来这被Cloudera团队确认为已知问题( https://community.cloudera.com/t5/Data-Ingestion-Integration/Flume-Twitter-data-looks-corrupt/td-p/48095)。
此方法使用类名 org.apache.flume.source.twitter.TwitterSource ,在flume.conf文件中正确设置。
后来,我尝试使用可以在Cloudera的网站(http://files.cloudera.com/samples/flume-sources-1.0-SNAPSHOT.jar)上找到的自定义Twitter JAR源代码以及许多其他教程,但这导致了另一个问题,同时从接收状态流Twitter API,应用程序陷入困境。
2017-05-25 09:51:37,875 (Twitter Stream consumer-1[initializing])
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)]
Establishing connection.
2017-05-25 09:52:10,545 (Twitter Stream consumer-1[Establishing connection])
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)]
Connection established.
2017-05-25 09:52:10,546 (Twitter Stream consumer-1[Establishing connection])
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)]
Receiving status stream.
此方法使用了类名 com.cloudera.flume.source.TwitterSource ,据报道该名称已不再有效,已被 org.apache.flume.source.twitter.TwitterSource <替换/ strong>即可。 所需文件的设置如下所示:
1) flume.conf
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource #(OLD CLASS)
#TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource #(NEW CLASS)
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>
TwitterAgent.sources.Twitter.keywords = hadoop, big data
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 100000
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
TwitterAgent.sinks.HDFS.hdfs.callTimeout = 30000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
2) flume-env.sh
JAVA_HOME=/usr/java/jdk1.7.0_67
JAVA_OPTS="-Xmx500m"
FLUME_CLASSPATH="/usr/lib/flume-ng/lib/flume-twitter-source.jar"
那么,对于发生了什么的任何想法?任何人都可以根据此设置运行Flume吗?是否有一种将Flume连接到我不知道的Twitter API的新方法? 谢谢!