Question

我正在Cloudera工作，刚刚开始学习它。因此，我一直在尝试使用Flume实现一个著名的Twitter示例。经过努力，我已经能够从Twitter流式传输数据，现在将其保存在文件中。现在获取数据后，我想对Twitter数据进行分析。但是问题是我无法在表中获取Twitter数据。我已经成功创建了“ tweets” 表，但是无法将数据加载到表中。在下面，我提供了Twitter.conf文件，外部表创建查询，数据加载查询，错误消息以及我所获得的部分数据。请指导我哪里做错了。请注意，我一直在使用HIVE编辑器编写查询。

Twitter.conf文件

# Naming the components on the current agent. 
TwitterAgent.sources = Twitter 
TwitterAgent.channels = MemChannel 
TwitterAgent.sinks = HDFS

# Describing/Configuring the source 
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = 95y0IPClnNPUTJ1AHSfvBLWes
TwitterAgent.sources.Twitter.consumerSecret = UmlNcFwiBIQIvuHF9J3M3xUv6UmJlQI3RZWT8ybF2KaKcDcAw5
TwitterAgent.sources.Twitter.accessToken = 994845066882699264-Yk0DNFQ4VJec9AaCQ7QTBlHldK5BSK1 
TwitterAgent.sources.Twitter.accessTokenSecret =  q1Am5G3QW4Ic7VBx6qJg0Iv7QXfk0rlDSrJi1qDjmY3mW
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing



# Describing/Configuring the channel 
TwitterAgent.channels.MemChannel.type = memory 
TwitterAgent.channels.MemChannel.capacity = 10000 
TwitterAgent.channels.MemChannel.transactionCapacity = 100

# Binding the source and sink to the channel 
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel 

# Describing/Configuring the sink 

TwitterAgent.sinks.HDFS.type = hdfs 
TwitterAgent.sinks.HDFS.hdfs.path = /user/cloudera/latestdata/
TwitterAgent.sinks.flumeHDFS.hdfs.fileType = DataStream 
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text 
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

外部表查询并在表查询中加载数据

CREATE External  TABLE tweets (


id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweet_count INT,
   retweeted_status STRUCT<
     text:STRING,
     user:STRUCT<screen_name:STRING,name:STRING>>,
   entities STRUCT<
     urls:ARRAY<STRUCT<expanded_url:STRING>>,
     user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
     hashtags:ARRAY<STRUCT<text:STRING>>>,
   text STRING,
   user STRUCT<
     screen_name:STRING,
     name:STRING,
     friends_count:INT,
     followers_count:INT,
     statuses_count:INT,
     verified:BOOLEAN,
     utc_offset:INT,
     time_zone:STRING>,
   in_reply_to_screen_name STRING
 ) 
 PARTITIONED BY (datehour INT)
 ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
 LOCATION '/user/cloudera/tweets';




LOAD DATA INPATH '/user/cloudera/latestdata/FlumeData.1540555155464'
INTO TABLE `default.tweets`
PARTITION (datehour='2013022516')

错误，当我尝试将数据加载到表中

处理语句时出错：失败：执行错误，从org.apache.hadoop.hive.ql.exec.MoveTask返回代码20013。文件格式错误。请检查文件格式。

我获得的推特数据文件

SEQ.org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.Text.``.R.LX。} H.f。>（。H.Objavro.schema。 {“ type”：“记录”，“名称”：“ Doc”，“ doc”：“ adoc”，“字段”：[{“名称”：“ id”，“类型”：“字符串”}，{“ name“：” user_friends_count“，” type“：[” int“，” null“]}，{” name“：” user_location“，” type“：[” string“，” null“]}，{” name“ ：“” user_description“，” type“：[” string“，” null“]}，{” name“：” user_statuses_count“，” type“：[” int“，” null“]}，{” name“：” user_followers_count“，” type“：[” int“，” null“]}，{” name“：” user_name“，” type“：[” string“，” null“]}，{” name“：” user_screen_name“ ，“ type”：[“ string”，“ null”]}，{“ name”：“ created_at”，“ type”：[“ string”，“ null”]}，{“ name”：“ text”，“ type“：[” string“，” null“]}，{” name“：” retweet_count“，” type“：[” long“，” null“]}，{” name“：” retweeted“，” type“ ：[“ boolean”，“ null”]}，{“ name”：“ in_reply_to_user_id”，“ type”：[“ long”，“ null”]}，{“ name”：“ source”，“ type”：[ “ string”，“ null”]}，{“ name”：“ in_reply_to_status_id”，“ type”：[“ long”，“ null”]}，{“ name”：“ media_url_https”，“ type”：[“ string “，” null“]}，{” name“：” expanded_url“，” type“：[” string“，” null“]}]}}。yږ��w��M߀J��＆1055790978844540929.�� gracieowehimnothng（2018-10-26T04：59：19Z�GIRLS我们把它还给了

已经1周了，无法确定解决方案。请让我知道是否需要更多信息，我将在此处提供。

Answer 1

Flume没有编写JSON，因此JSONSerde不是您想要的。

您需要调整这些行

TwitterAgent.sinks.flumeHDFS.hdfs.fileType = DataStream 
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

Flume当前正在编写包含Avro的Sequencefile

SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.Text� ��R�LX� }H�f�>(�H�Objavro.schema�

Hive可以按原样读取Avro，因此不清楚为什么使用JSONSerde

文件格式错误

1 个答案: