我正在尝试使用orc作为hadoop流的输入格式
这是我如何运行它
export HADOOP_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /home/mr/mapper.py -mapper /home/mr/mapper.py \
-file /home/mr/reducer.py -reducer /home/mr/reducer.py \
-input /user/cloudera/input/users/orc \
-output /user/cloudera/output/simple \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat \
但是我收到了这个错误:
错误:java.io.IOException:拆分类 未找到org.apache.hadoop.hive.ql.io.orc.OrcSplit 在org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:363) 在org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426) 在org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) 在org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) 在javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) 在org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)引起:java.lang.ClassNotFoundException:Class 未找到org.apache.hadoop.hive.ql.io.orc.OrcSplit 在org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2018) 在org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:361) ......还有7个
看起来OrcSplit类应该在hive-exec.jar
中答案 0 :(得分:1)
更简单的解决方案是让hadoop-streaming使用-libjars
参数为您分发lib jar。此参数采用逗号分隔的列表jar。举个例子,你可以这样做:
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-libjars /opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar
-file /home/mr/mapper.py -mapper /home/mr/mapper.py \
-file /home/mr/reducer.py -reducer /home/mr/reducer.py \
-input /user/cloudera/input/users/orc \
-output /user/cloudera/output/simple \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
答案 1 :(得分:0)
我找到了答案。我的问题是我只在一个节点上设置HADOOP_CLASSPATH var。所以我应该在everynode上设置它或使用分布式缓存