Question

我的MapReduce应用程序计算Hive表中字段值的使用情况。在包含/usr/lib/hadood，/usr/lib/hive和/usr/lib/hcatalog目录中的所有jar之后，我设法从Eclipse构建并运行它。它有效。

经过多次挫折之后，我还成功编译并以Maven项目的形式运行它：

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bigdata.hadoop</groupId>
<artifactId>FieldCounts</artifactId>
  <packaging>jar</packaging>
  <name>FieldCounts</name>
  <version>0.0.1-SNAPSHOT</version>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
<dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
<dependency>   
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.3.0</version>
</dependency>    
<dependency>
    <groupId>org.apache.hcatalog</groupId>
    <artifactId>hcatalog-core</artifactId>
    <version>0.11.0</version>
</dependency>
</dependencies>
</project>

要从命令行运行作业，必须使用以下脚本：

#!/bin/sh
export LIBJARS=/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar,/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/hive/lib/jdo-api-3.0.1.jar,/usr/lib/hive/lib/antlr-runtime-3.4.jar,/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar,/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:.:/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar:/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/libfb303-0.9.0.jar:/usr/lib/hive/lib/jdo-api-3.0.1.jar:/usr/lib/hive/lib/antlr-runtime-3.4.jar:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar:/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
hadoop jar FieldCounts-0.0.1-SNAPSHOT.jar com.bigdata.hadoop.FieldCounts -libjars ${LIBJARS} simple simpout

现在Hadoop创建并启动下一个失败的作业，因为 Hadoop找不到Map类：

14/03/26 16:25:58 INFO mapreduce.Job: Running job: job_1395407010870_0007
14/03/26 16:26:07 INFO mapreduce.Job: Job job_1395407010870_0007 running in uber mode : false
14/03/26 16:26:07 INFO mapreduce.Job:  map 0% reduce 0%
14/03/26 16:26:13 INFO mapreduce.Job: Task Id : attempt_1395407010870_0007_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more

为什么会这样？作业jar包含所有类，包括Map：

 jar tvf FieldCounts-0.0.1-SNAPSHOT.jar 
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
   121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
     0 Wed Mar 26 14:29:58 MSK 2014 com/
     0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
     0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
  3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
  4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
  2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
  1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
  4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
  1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
   123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties
[hdfs@localhost target]$ jar tvf FieldCounts-0.0.1-SNAPSHOT.jar 
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
   121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
     0 Wed Mar 26 14:29:58 MSK 2014 com/
     0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
     0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
  3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
  4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
  2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
  1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
  4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
     0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
  1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
   123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties

有什么问题？我应该将Map和Reduce类放在单独的文件中吗？

MapReduce代码：

package com.bigdata.hadoop;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hcatalog.mapreduce.*;
import org.apache.hcatalog.data.*;
import org.apache.hcatalog.data.schema.*;
import org.apache.log4j.Logger;

public class FieldCounts extends Configured implements Tool {

    public static class Map extends Mapper<WritableComparable, HCatRecord, TableFieldValueKey, IntWritable> {

        static Logger logger = Logger.getLogger("com.foo.Bar");

        static boolean firstMapRun = true;
        static List<String> fieldNameList = new LinkedList<String>();
        /**
         * Return a list of field names not containing `id` field name
         * @param schema
         * @return
         */
        static List<String> getFieldNames(HCatSchema schema) {
            // Filter out `id` name just once
            if (firstMapRun) {
                firstMapRun = false;
                List<String> fieldNames = schema.getFieldNames();
                for (String fieldName : fieldNames) {
                    if (!fieldName.equals("id")) {
                        fieldNameList.add(fieldName);
                    }
                }
            } // if (firstMapRun)
            return fieldNameList;
        }

        @Override
      protected void map( WritableComparable key,
                          HCatRecord hcatRecord,
                          //org.apache.hadoop.mapreduce.Mapper
                          //<WritableComparable, HCatRecord, Text, IntWritable>.Context context)
                          Context context)
            throws IOException, InterruptedException {

            HCatSchema schema = HCatBaseInputFormat.getTableSchema(context.getConfiguration());

           //String schemaTypeStr = schema.getSchemaAsTypeString();
           //logger.info("******** schemaTypeStr ********** : "+schemaTypeStr);

           //List<String> fieldNames = schema.getFieldNames();
            List<String> fieldNames = getFieldNames(schema);
            for (String fieldName : fieldNames) {
                Object value = hcatRecord.get(fieldName, schema);
                String fieldValue = null;
                if (null == value) {
                    fieldValue = "<NULL>";
                } else {
                    fieldValue = value.toString();
                }
                //String fieldNameValue = fieldName+"."+fieldValue;
                //context.write(new Text(fieldNameValue), new IntWritable(1));
                TableFieldValueKey fieldKey = new TableFieldValueKey();
                fieldKey.fieldName = fieldName;
                fieldKey.fieldValue = fieldValue;
                context.write(fieldKey, new IntWritable(1));
            }

        }       
    }

    public static class Reduce extends Reducer<TableFieldValueKey, IntWritable,
                                       WritableComparable, HCatRecord> {

        protected void reduce( TableFieldValueKey key,
                               java.lang.Iterable<IntWritable> values,
                               Context context)
                               //org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
                               //WritableComparable, HCatRecord>.Context context)
            throws IOException, InterruptedException {
            Iterator<IntWritable> iter = values.iterator();
            int sum = 0;
            // Sum up occurrences of the given key 
            while (iter.hasNext()) {
                IntWritable iw = iter.next();
                sum = sum + iw.get();
            }

            HCatRecord record = new DefaultHCatRecord(3);
            record.set(0, key.fieldName);
            record.set(1, key.fieldValue);
            record.set(2, sum);

            context.write(null, record);
        }
    }

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        args = new GenericOptionsParser(conf, args).getRemainingArgs();

        // To fix Hadoop "META-INFO" (http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file)
        conf.set("fs.hdfs.impl",
                org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
        conf.set("fs.file.impl",
                org.apache.hadoop.fs.LocalFileSystem.class.getName());

        // Get the input and output table names as arguments
        String inputTableName = args[0];
        String outputTableName = args[1];
        // Assume the default database
        String dbName = null;

        Job job = new Job(conf, "FieldCounts");

        HCatInputFormat.setInput(job,
                InputJobInfo.create(dbName, inputTableName, null));
        job.setJarByClass(FieldCounts.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        // An HCatalog record as input
        job.setInputFormatClass(HCatInputFormat.class);

        // Mapper emits TableFieldValueKey as key and an integer as value
        job.setMapOutputKeyClass(TableFieldValueKey.class);
        job.setMapOutputValueClass(IntWritable.class);

        // Ignore the key for the reducer output; emitting an HCatalog record as
        // value
        job.setOutputKeyClass(WritableComparable.class);
        job.setOutputValueClass(DefaultHCatRecord.class);
        job.setOutputFormatClass(HCatOutputFormat.class);

        HCatOutputFormat.setOutput(job,
                OutputJobInfo.create(dbName, outputTableName, null));
        HCatSchema s = HCatOutputFormat.getTableSchema(job);
        System.err.println("INFO: output schema explicitly set for writing:"
                + s);
        HCatOutputFormat.setSchema(job, s);
        return (job.waitForCompletion(true) ? 0 : 1);
    }

    public static void main(String[] args) throws Exception {
        String classpath = System.getProperty("java.class.path");
        System.out.println("*** CLASSPATH: "+classpath);         
        int exitCode = ToolRunner.run(new FieldCounts(), args);
        System.exit(exitCode);
    }
}

Answer 1

我发现问题出在MapReduce jar所在的目录权限中。这个jar是在普通的，而不是hdfs用户的主目录中构建的。只要此MRD作业将其工作结果直接输出到Hive表中，就应该在hdfs用户下运行。如果这样的作业在普通用户下运行，它就没有权限将数据写入Hive表！

另一方面，CentOS中常规用户的主目录有700个权限。因此，当您在拥有此主目录的用户不同的用户下运行hadoop jar ...命令时，在Hadoop加载类的过程中，某些地方会拒绝访问MRD jar。这就是为什么在hdfs用户下此作业会产生java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.MyMap not found。

递归更改主目录的权限从700到755，其中构建了MRD jar解决了这个问题。

然而，还有一个更重要的问题：如何在普通用户下运行作业，以便它有权将数据写入Hive表？

Hadoop MapReduce作业启动但找不到Map类？

1 个答案: