当我尝试Cloudera 5.4.2时,有一个小问题。基于这篇文章
Apache Flume - 获取Twitter数据 http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm
它尝试使用Flume和twitter流来获取推文以进行数据分析。所有事情都很开心,创建Twitter应用程序,在HDFS上创建目录,配置Flume然后开始获取数据,在推文之上创建模式。
然后,这是问题所在。 Twitter流式传输将推文转换为Avro格式,并将Avro事件发送到下游HDFS接收器,当Avro支持的Hive表加载数据时,我收到错误消息“Avro块大小无效或太大”。
哦,什么是avro块和块大小的限制?我可以改变吗?根据这条消息,这是什么意思?这是文件的错吗?这是一些记录的错吗?如果Twitter的流媒体遇到错误数据,它应该核心。如果将推文转换为Avro格式都是好的,相反,应该正确读取Avro数据,对吗?
我也尝试了avro-tools-1.7.7.jar
java -jar avro-tools-1.7.7.jar tojson FlumeData.1458090051232
{"id":"710300089206611968","user_friends_count":{"int":1527},"user_location":{"string":"1633"},"user_description":{"string":"Steady Building an Empire..... #UGA"},"user_statuses_count":{"int":44471},"user_followers_count":{"int":2170},"user_name":{"string":"Esquire Shakur"},"user_screen_name":{"string":"Esquire_Bowtie"},"created_at":{"string":"2016-03-16T23:01:52Z"},"text":{"string":"RT @ugaunion: .@ugasga is hosting a debate between the three SGA executive tickets. Learn more about their plans to serve you https://t.co/…"},"retweet_count":{"long":0},"retweeted":{"boolean":true},"in_reply_to_user_id":{"long":-1},"source":{"string":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"},"in_reply_to_status_id":{"long":-1},"media_url_https":null,"expanded_url":null}
{"id":"710300089198088196","user_friends_count":{"int":100},"user_location":{"string":"DM開放してます(`・ω・´)"},"user_description":{"string":"Exception in thread "main" org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:77)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
Caused by: java.io.IOException: Block size invalid or too large for this implementation: -40
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:266)
... 4 more
同样的问题。我谷歌很多,没有答案。
如果你遇到这个问题,有人能给我一个解决方案吗?或者有人帮助提供一个线索,如果你完全理解Avro的东西或Twitter下面的流媒体。
这确实是个问题。想一想。
答案 0 :(得分:0)
使用Cloudera TwitterSource
否则会遇到这个问题。
Unable to correctly load twitter avro data into hive table
在文章中:这是apache TwitterSource
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
Twitter 1% Firehose Source
This source is highly experimental. It connects to the 1% sample Twitter Firehose using streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.
但它应该是cloudera TwitterSource:
https://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
并不只是下载pre build jar,因为我们的cloudera版本是5.4.2,否则你会收到这个错误:
Cannot run Flume because of JAR conflict
你应该使用maven编译它
https://github.com/cloudera/cdh-twitter-example
下载并编译:flume-sources.1.0-SNAPSHOT.jar。这个jar包含Cloudera TwitterSource的实现。
步骤:
wget https://github.com/cloudera/cdh-twitter-example/archive/master.zip
sudo yum install apache-maven 放到水槽插件目录:
/var/lib/flume-ng/plugins.d/twitter-streaming/lib/flume-sources-1.0-SNAPSHOT.jar
mvn package
注意:Yum更新到最新版本,否则编译(mvn包)因安全问题而失败。