我正在尝试使用flume从twitter检索数据并在JSON FORMAT中存储到hdfs。数据正在加载到HDFS。但不是JSON格式。
我从HDFS文件中附加几行,这些行存储在twitter:
中Objavro.schema\E4
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}\00\E0D\C9H\B8$\DCb,C\8A5y\D1n\CE$733267766577356800\00\96\00Zumaran \00\C6C.A.B//C.A.H
Wsp:351 220-1251
Fb:Ramiro Pedernera✌
Insta:Ramiropedernera
Snapp:ramipedernera12\00\B2\9E\00\B2(\00(DIVI^Lista RAMIRO P.\00RamiPedernera12\00(2016-05-19T17:37:13Z\00tGaray culiadaso me metió una patada en la frente \00\00\00\00\00\00\A8<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>\00\E0D\C9H\B8$\DCb,C\8A5y\D1n
Objavro.schema\E4
由于这不是JSON格式,因此无法通过在HIVE中创建表并加载此数据来处理它。 所以请帮我把JSON格式的twitter数据加载到HADOOP HDFS
这是我使用的命令:
bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
并附上了twitter.conf:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey =********
TwitterAgent.sources.Twitter.consumerSecret =*************
TwitterAgent.sources.Twitter.accessToken =****************
TwitterAgent.sources.Twitter.accessTokenSecret =*****************
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser_/twitter-cool
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = json
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
TwitterAgent.sources.Twitter.handler = org.apache.flume.source.http.JSONHandler
答案 0 :(得分:2)
要从Avro更改为JSON格式,您必须执行以下几个步骤:
在配置文件中更改属性
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
到
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
com.cloudera.flume.source.TwitterSource
是一个自定义类,它以JSON格式在HDFS中写入记录。
要获得此课程,请转到https://github.com/cloudera/cdh-twitter-example并将flume-sources文件夹下载到您的本地并从中制作jar文件。
建立水槽来源JAR:
$ cd hive-serdes
$ mvn package
$ cd ..
这将在目标目录中生成一个名为flume-sources-1.0-SNAPSHOT.jar的文件。
将flume-sources-1.0-SNAPSHOT.jar
复制到/usr/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
,再复制到/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
如果这些目录不存在,则创建为
sudo mkdir -p /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/
sudo mkdir -p /var/lib/flume-ng/plugins.d/twitter-streaming/lib/
有关详情,请参阅Analyzing Twitter Data Using CDH
希望这能帮到你!!!
答案 1 :(得分:0)
来自Flume的TwitterSource的事件默认采用Avro格式。要更改它,您必须修改TwitterSource的源文件以获取原始格式的推文(json)。幸运的是,Cloudera已经在https://github.com/cloudera/cdh-twitter-example
这样做了您需要做的就是按照自述文件中的步骤安装新TwitterSource的库,并将Flume配置文件中的TwitterAgent.sources.Twitter.type
更改为com.cloudera.flume.source.TwitterSource
。在同一个项目中有一个配置文件的例子。
希望有所帮助