Question

我刚刚安装了presto，当我使用presto-cli查询hive数据时，我收到以下错误：

~$ presto --catalog hive --schema default
presto:default> select count(*) from test3;

Query 20171213_035723_00007_3ktan, FAILED, 1 node
Splits: 131 total, 14 done (10.69%)
0:18 [1.04M rows, 448MB] [59.5K rows/s, 25.5MB/s]

Query 20171213_035723_00007_3ktan failed: com.facebook.presto.hive.$internal.org.codehaus.jackson.JsonParseException: Invalid UTF-8 start byte 0xa5
 at [Source: java.io.ByteArrayInputStream@6eb5bdfd; line: 1, column: 376]

只有在使用聚合函数（如count，sum等）时才会出现错误。但是当我在Hive CLI上使用相同的查询时，它可以工作（但是因为它将查询转换为map-reduce作业所以需要花费很多时间。）

$ hive
WARNING: Use "yarn jar" to launch YARN applications.

Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> select count(*) from test3;
...
MapReduce Total cumulative CPU time: 17 minutes 56 seconds 600 msec
Ended Job = job_1511341039258_0024
MapReduce Jobs Launched:
Stage-Stage-1: Map: 87  Reduce: 1   Cumulative CPU: 1076.6 sec   HDFS Read: 23364693216 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 17 minutes 56 seconds 600 msec
OK
51751422
Time taken: 269.143 seconds, Fetched: 1 row(s)

关键是同样的查询适用于Hive，但不适用于Presto，我无法弄清楚原因。我怀疑这是因为在Hive和Presto上使用的2 json库是不同的，但我不太确定。我使用查询在Hive上创建了外部表：

hive> create external table test2 (app string, contactRefId string, createdAt struct <`date`: string, timezone: string, timezone_type: bigint>, eventName string, eventTime bigint, shopId bigint) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE LOCATION '/data/data-new/2017/11/29/';

任何人都可以帮我吗？

Answer 1

在此处发布以方便参考：

来自where OP documented a solution：

我通过使用以下序列号成功解决了该问题：https://github.com/electrum/hive-serde（添加到/ usr / lib / presto / plugin / hive-hadoop2 /的presto和/ usr / lib / hive-hcatalog / share / hcatalog /）

使用presto从Hive外部表查询：无效的UTF-8起始字节

1 个答案: