Hive崩溃在where子句上

时间:2016-10-14 19:01:58

标签: mongodb hadoop hive emr

我正在努力让hive-hadoop-mongo设置正常工作。我已经从json文件将数据导入mongodb,然后我在hive中创建了连接到mongo的内部和外部表:

CREATE EXTERNAL TABLE reviews(
    user_id STRING, 
    review_id STRING, 
    stars INT, 
    date1 STRING,
    text STRING,
    type STRING,
    business_id STRING
     )
    STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
    WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')
    TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/test.reviews');

这部分工作正常,因为选择所有查询(select * from reviews)输出所有应有的东西。但是当我使用where子句(例如select * from reviews where stars=4)时,hive会崩溃。

当我启动配置单元时,我添加了以下jar:

add jar mongo-hadoop.jar;
add jar mongo-java-driver-3.3.0.jar;
add jar mongo-hadoop-hive-2.0.1.jar;

如果它在任何意义上都是相关的,我正在使用亚马逊的EMR集群,我通过ssh连接。

感谢所有帮助

以下是错误配置单元抛出:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression(Ljava/lang/String;)Lorg/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc;
    at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getFilter(HiveMongoInputFormat.java:134)
    at com.mongodb.hadoop.hive.input.HiveMongoInputFormat.getRecordReader(HiveMongoInputFormat.java:103)
    at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:691)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:329)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:455)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:424)
    at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:144)
    at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1885)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

3 个答案:

答案 0 :(得分:0)

如下所示的克里特岛表并检查。

CREATE [EXTERNAL] TABLE <tablename>
(<schema>)
ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe'
[WITH SERDEPROPERTIES('mongo.columns.mapping'='<JSON mapping>')]
STORED AS INPUTFORMAT 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
OUTPUTFORMAT 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
[LOCATION '<path to existing directory>'];

不是使用StorageHandler从Hive对象读取,序列化,反序列化和输出数据到BSON对象,而是单独列出各个组件。这是因为在处理本机HDFS文件系统时使用StorageHandler会产生太多负面影响

答案 1 :(得分:0)

我看到了

WITH SERDEPROPERTIES('mongo.columns.mapping'='{"date1":"date"}')

并且您正在查询未映射的列星。

答案 2 :(得分:0)

我在群集中遇到了这个问题。

群集配置单元版本高于mongo-hive中的版本(1.2.1)

旧类org.apache.hadoop.hive.ql.exec.Utilities.deserializeExpression已重命名为org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpression

您需要自己重建jar。