使用flume从twitter检索数据并以JSON FORMAT存储到hdfs

时间:2016-05-19 13:23:08

标签: json hadoop twitter hive flume

我正在尝试使用flume从twitter检索数据并在JSON FORMAT中存储到hdfs。数据正在加载到HDFS。但不是JSON格式。

我从HDFS文件中附加几行,这些行存储在twitter:

Objavro.schema\E4
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}\00\E0D\C9H\B8$\DCb,C\8A5y\D1n\CE$733267766577356800\00\96\00Zumaran \00\C6C.A.B//C.A.H
Wsp:351 220-1251
Fb:Ramiro Pedernera✌
Insta:Ramiropedernera
Snapp:ramipedernera12\00\B2\9E\00\B2(\00(DIVI^Lista RAMIRO P.\00RamiPedernera12\00(2016-05-19T17:37:13Z\00tGaray culiadaso me metió una patada en la frente \00\00\00\00\00\00\A8<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>\00\E0D\C9H\B8$\DCb,C\8A5y\D1n
Objavro.schema\E4

由于这不是JSON格式,因此无法通过在HIVE中创建表并加载此数据来处理它。 所以请帮我把JSON格式的twitter数据加载到HADOOP HDFS

这是我使用的命令:

bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

并附上了twitter.conf:

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey =********
TwitterAgent.sources.Twitter.consumerSecret =*************
TwitterAgent.sources.Twitter.accessToken =****************
TwitterAgent.sources.Twitter.accessTokenSecret =*****************
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:54310/user/hduser_/twitter-cool
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = json
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
TwitterAgent.sources.Twitter.handler = org.apache.flume.source.http.JSONHandler

2 个答案:

答案 0 :(得分:2)

要从Avro更改为JSON格式,您必须执行以下几个步骤:

在配置文件中更改属性

TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

com.cloudera.flume.source.TwitterSource是一个自定义类,它以JSON格式在HDFS中写入记录。

要获得此课程,请转到https://github.com/cloudera/cdh-twitter-example并将flume-sources文件夹下载到您的本地并从中制作jar文件。

  1. 建立水槽来源JAR:

    $ cd hive-serdes
    $ mvn package
    $ cd ..

  2. 这将在目标目录中生成一个名为flume-sources-1.0-SNAPSHOT.jar的文件。

    1. 将JAR添加到Flume类路径
    2. flume-sources-1.0-SNAPSHOT.jar复制到/usr/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar,再复制到/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar

      如果这些目录不存在,则创建为

      sudo mkdir -p /usr/lib/flume-ng/plugins.d/twitter-streaming/lib/
      
      sudo mkdir -p /var/lib/flume-ng/plugins.d/twitter-streaming/lib/
      

      有关详情,请参阅Analyzing Twitter Data Using CDH

      希望这能帮到你!!!

答案 1 :(得分:0)

来自Flume的TwitterSource的事件默认采用Avro格式。要更改它,您必须修改TwitterSource的源文件以获取原始格式的推文(json)。幸运的是,Cloudera已经在https://github.com/cloudera/cdh-twitter-example

这样做了

您需要做的就是按照自述文件中的步骤安装新TwitterSource的库,并将Flume配置文件中的TwitterAgent.sources.Twitter.type更改为com.cloudera.flume.source.TwitterSource。在同一个项目中有一个配置文件的例子。

希望有所帮助