我尝试使用HCatalog JSON Serde(来自hcatalog-core-0.5.0-cdh4.7.0.jar)使用hive表。我在CDH4上运行(Hadoop 2.0.0-cdh4.7.0和Hive 0.10.0-cdh4.7.0)。
表格定义:
CREATE EXTERNAL TABLE some_table(
user_id int COMMENT 'from deserializer',
event_time int COMMENT 'from deserializer',
some_string string COMMENT 'from deserializer',
some_id string COMMENT 'from deserializer',
another_id int COMMENT 'from deserializer')
PARTITIONED BY (
year int,
month int,
day int)
ROW FORMAT SERDE
'org.apache.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:8020/somedir/some_table'
TBLPROPERTIES (
'last_modified_by'='volker',
'last_modified_time'='1424980336',
'transient_lastDdlTime'='1424980952')
创建分区:
alter table some_table add if not exists partition (year=2015,month=02,day=26) location '/somedir/some_table/year=2015/month=02/day=26'
第一遍很顺利,我可以在选择所有列时读取数据:
hive> select * from some_table limit 10;
OK
671764813 1424980760 fbx NtiwgY 6 2015 02 26
1632511524 1424980760 fbx AdMybO 10 2015 02 26
1201817175 1424980760 fbx GgQJEd 6 2015 02 26
1621940110 1424980760 fbx qmsXNQ 12 2015 02 26
326380277 1424980760 fbx zgVFgP 2 2015 02 26
1256744282 1424980760 fbx GeIFxq 6 2015 02 26
1741961976 1424980760 fbx CiuxZU 8 2015 02 26
2009923690 1424980760 fbx ZmGOvK 2 2015 02 26
1728798342 1424980760 fbx YikDcV 8 2015 02 26
688185292 1424980760 fbx NssSWN 7 2015 02 26
然而,当我尝试在查询失败的任何地方读取或引用特定字段时:
hive> select another_id from some_table limit 10;
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:91)
at org.apache.hadoop.fs.Path.<init>(Path.java:99)
at org.apache.hadoop.fs.Path.<init>(Path.java:58)
at org.apache.hadoop.mapred.JobClient.copyRemoteFiles(JobClient.java:745)
at org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:849)
at org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:774)
at org.apache.hadoop.mapred.JobClient.access$400(JobClient.java:178)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:991)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:950)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:66)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1383)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1169)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:982)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
当我在where条件中使用字段时也会发生相同的情况。
我可以使用where子句中的分区字段,因此select * from some_table where year=2015
工作正常,而select year from some_table limit 10
因上述错误而失败。
HDFS中的文件如下所示:
{"another_id":6,"user_id":671764813,"some_id":"NtiwgY","event_time":1424980760,"some_string":"fbx"}
{"another_id":10,"user_id":1632511524,"some_id":"AdMybO","event_time":1424980760,"some_string":"fbx"}
{"another_id":6,"user_id":1201817175,"some_id":"GgQJEd","event_time":1424980760,"some_string":"fbx"}
我希望这只是我的表定义的一个问题。欢迎任何帮助。
答案 0 :(得分:0)
我没有使用HCatalog SerDe,但是,我想要的是将JSON存储在HDFS中并将其作为Hive表读取,我最终通过使用不同的SerDe成功完成了在这里找到:
https://github.com/rcongiu/Hive-JSON-Serde
对我来说,在CDH4上完全正常。