这里是场景:3个文件夹位于hdfs中。文件如下:
/root/20140901/part-0
/root/20140901/part-1
/root/20140901/part-2
/root/20140902/part-0
/root/20140902/part-1
/root/20140902/part-2
/root/20140903/part-0
/root/20140903/part-1
/root/20140903/part-2
创建一个hive表,其命令如下所示,我调用hql = [select * from hive_combine_test where rdm > 50000;
],这将花费9个映射器,与hdfs中的文件数量相同。
CREATE EXTERNAL table hive_combine_test
(id string,
rdm string)
PARTITIONED BY (dateid string)
row format delimited fields terminated by '\t'
stored as textfile;
ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140901')
location '/root/20140901';
ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140902')
location '/root/20140902';
ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140903')
location '/root/20140903';
但我想要的是将所有部分-i组合在一起,以这种方式,应该只有三个映射器。
我尝试继承org.apache.hadoop.hive.ql.io.HiveInputFormat
以测试自定义JudHiveInputFormat
是否有效。
public class JudHiveInputFormat<K extends WritableComparable, V extends Writable>
extends HiveInputFormat<WritableComparable, Writable> {
}
但是当我在hive中挂载时,它会返回异常:
hive> add jar /my_path/jud_udf.jar;
hive> set hive.input.format=com.judking.hive.inputformat.JudHiveInputFormat;
hive> select * from hive_combine_test where rdm > 50000;
java.lang.RuntimeException: com.judking.hive.inputformat.JudCombineHiveInputFormat
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:290)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1472)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1239)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1057)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:880)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:870)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
有人能给我一些线索吗?非常感谢!
答案 0 :(得分:0)
据我所知,要在Hive中添加自定义INPUT / OUTPUT格式,您需要在create table语句中提及该格式。有点像这样:
CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT '<your input format class name >' OUTPUTFORMAT '<your output format class name>';
由于您只需要InputFormat,因此您的create table语句将如下所示:
CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'JudHiveInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';
为什么你需要提到这个OUTPUT格式类,因为你已经覆盖了INPUT格式Hive也需要OUTPUT类,所以在这里我们需要说Hive使用它的DEFAULT OUTPUT格式类。
可能你可以尝试一下。
希望它有所帮助...... !!!