使用presto从Hive外部表查询:无效的UTF-8起始字节

时间:2017-12-13 04:15:14

标签: hive hdfs presto

我刚刚安装了presto,当我使用presto-cli查询hive数据时,我收到以下错误:

~$ presto --catalog hive --schema default
presto:default> select count(*) from test3;

Query 20171213_035723_00007_3ktan, FAILED, 1 node
Splits: 131 total, 14 done (10.69%)
0:18 [1.04M rows, 448MB] [59.5K rows/s, 25.5MB/s]

Query 20171213_035723_00007_3ktan failed: com.facebook.presto.hive.$internal.org.codehaus.jackson.JsonParseException: Invalid UTF-8 start byte 0xa5
 at [Source: java.io.ByteArrayInputStream@6eb5bdfd; line: 1, column: 376]

只有在使用聚合函数(如count,sum等)时才会出现错误。 但是当我在Hive CLI上使用相同的查询时,它可以工作(但是因为它将查询转换为map-reduce作业所以需要花费很多时间。)

$ hive
WARNING: Use "yarn jar" to launch YARN applications.

Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> select count(*) from test3;
...
MapReduce Total cumulative CPU time: 17 minutes 56 seconds 600 msec
Ended Job = job_1511341039258_0024
MapReduce Jobs Launched:
Stage-Stage-1: Map: 87  Reduce: 1   Cumulative CPU: 1076.6 sec   HDFS Read: 23364693216 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 17 minutes 56 seconds 600 msec
OK
51751422
Time taken: 269.143 seconds, Fetched: 1 row(s)

关键是同样的查询适用于Hive,但不适用于Presto,我无法弄清楚原因。我怀疑这是因为在Hive和Presto上使用的2 json库是不同的,但我不太确定。 我使用查询在Hive上创建了外部表:

hive> create external table test2 (app string, contactRefId string, createdAt struct <`date`: string, timezone: string, timezone_type: bigint>, eventName string, eventTime bigint, shopId bigint) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/data-new/2017/11/29/';

任何人都可以帮我吗?

1 个答案:

答案 0 :(得分:0)

在此处发布以方便参考:

来自where OP documented a solution

  

我通过使用以下序列号成功解决了该问题:https://github.com/electrum/hive-serde(添加到/ usr / lib / presto / plugin / hive-hadoop2 /的presto和/ usr / lib / hive-hcatalog / share / hcatalog /)