我正在Cloudera工作,刚刚开始学习它。因此,我一直在尝试使用Flume实现一个著名的Twitter示例。经过努力,我已经能够从Twitter流式传输数据,现在将其保存在文件中。现在获取数据后,我想对Twitter数据进行分析。但是问题是我无法在表中获取Twitter数据。我已经成功创建了“ tweets” 表,但是无法将数据加载到表中。 在下面,我提供了Twitter.conf文件,外部表创建查询,数据加载查询,错误消息以及我所获得的部分数据。请指导我哪里做错了。请注意,我一直在使用HIVE编辑器编写查询。
Twitter.conf文件
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = 95y0IPClnNPUTJ1AHSfvBLWes
TwitterAgent.sources.Twitter.consumerSecret = UmlNcFwiBIQIvuHF9J3M3xUv6UmJlQI3RZWT8ybF2KaKcDcAw5
TwitterAgent.sources.Twitter.accessToken = 994845066882699264-Yk0DNFQ4VJec9AaCQ7QTBlHldK5BSK1
TwitterAgent.sources.Twitter.accessTokenSecret = q1Am5G3QW4Ic7VBx6qJg0Iv7QXfk0rlDSrJi1qDjmY3mW
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
# Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
# Describing/Configuring the sink
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = /user/cloudera/latestdata/
TwitterAgent.sinks.flumeHDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
外部表查询并在表查询中加载数据
CREATE External TABLE tweets (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/cloudera/tweets';
LOAD DATA INPATH '/user/cloudera/latestdata/FlumeData.1540555155464'
INTO TABLE `default.tweets`
PARTITION (datehour='2013022516')
错误,当我尝试将数据加载到表中
处理语句时出错:失败:执行错误,从org.apache.hadoop.hive.ql.exec.MoveTask返回代码20013。文件格式错误。请检查文件格式。
我获得的推特数据文件
SEQ.org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.Text.``.R.LX。} H.f。>(。H.Objavro.schema。 {“ type”:“记录”,“名称”:“ Doc”,“ doc”:“ adoc”,“字段”:[{“名称”:“ id”,“类型”:“字符串”},{“ name“:” user_friends_count“,” type“:[” int“,” null“]},{” name“:” user_location“,” type“:[” string“,” null“]},{” name“ :“” user_description“,” type“:[” string“,” null“]},{” name“:” user_statuses_count“,” type“:[” int“,” null“]},{” name“:” user_followers_count“,” type“:[” int“,” null“]},{” name“:” user_name“,” type“:[” string“,” null“]},{” name“:” user_screen_name“ ,“ type”:[“ string”,“ null”]},{“ name”:“ created_at”,“ type”:[“ string”,“ null”]},{“ name”:“ text”,“ type“:[” string“,” null“]},{” name“:” retweet_count“,” type“:[” long“,” null“]},{” name“:” retweeted“,” type“ :[“ boolean”,“ null”]},{“ name”:“ in_reply_to_user_id”,“ type”:[“ long”,“ null”]},{“ name”:“ source”,“ type”:[ “ string”,“ null”]},{“ name”:“ in_reply_to_status_id”,“ type”:[“ long”,“ null”]},{“ name”:“ media_url_https”,“ type”:[“ string “,” null“]},{” name“:” expanded_url“,” type“:[” string“,” null“]}]}}。yږ���w����M߀J��&1055790978844540929.�� �gracieowehimnothng(2018-10-26T04:59:19Z�GIRLS我们把它还给了
已经1周了,无法确定解决方案。 请让我知道是否需要更多信息,我将在此处提供。
答案 0 :(得分:0)
Flume没有编写JSON,因此JSONSerde不是您想要的。
您需要调整这些行
TwitterAgent.sinks.flumeHDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
Flume当前正在编写包含Avro的Sequencefile
SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.Text� �����R�LX� }H�f�>(�H�Objavro.schema�
Hive可以按原样读取Avro,因此不清楚为什么使用JSONSerde