Question

我使用此代码来运行单词count hadoop job。当我使用hadoop eclipse插件从eclipse内部运行时，WordCountDriver运行。当我将mapper和reducer类打包为jar并将其放在类路径中时，WordCountDriver也会从命令行运行。

但是，如果我尝试从命令行运行它而没有将mapper和reducer类作为jar添加到类路径中，它会失败，尽管我将这两个类添加到类路径中。我想知道hadoop是否有一些限制来接受mapper＆amp; reducer类作为普通类文件。创建一个罐子总是强制性的吗？

public class WordCountDriver extends Configured implements Tool {



public static final String HADOOP_ROOT_DIR = "hdfs://universe:54310/app/hadoop/tmp";


static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private Text word = new Text();
    private final IntWritable one = new IntWritable(1);

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        StringTokenizer itr = new StringTokenizer(line.toLowerCase());
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            context.write(word, one);
        }
    }
};

static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;

        for (IntWritable value : values) {
            sum += value.get(); // process value
        }       
        context.write(key, new IntWritable(sum));
    }
};


/**
 * 
 */
public int run(String[] args) throws Exception {

    Configuration conf = getConf();

    conf.set("mapred.job.tracker", "universe:54311");

    Job job = new Job(conf, "Word Count");

    // specify output types
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    // specify input and output dirs
    FileInputFormat.addInputPath(job, new Path(HADOOP_ROOT_DIR + "/input"));
    FileOutputFormat.setOutputPath(job, new Path(HADOOP_ROOT_DIR + "/output"));

    // specify a mapper
    job.setMapperClass(WordCountDriver.WordCountMapper.class);

    // specify a reducer
    job.setReducerClass(WordCountDriver.WordCountReducer.class);
    job.setCombinerClass(WordCountDriver.WordCountReducer.class);

    job.setJarByClass(WordCountDriver.WordCountMapper.class);

    return job.waitForCompletion(true) ? 0 : 1;
}

/**
 * 
 * @param args
 * @throws Exception
 */
public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new WordCountDriver(), args);
    System.exit(res);
}

public static final String HADOOP_ROOT_DIR = "hdfs://universe:54310/app/hadoop/tmp"; static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private Text word = new Text(); private final IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }; static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); // process value } context.write(key, new IntWritable(sum)); } }; /** * */ public int run(String[] args) throws Exception { Configuration conf = getConf(); conf.set("mapred.job.tracker", "universe:54311"); Job job = new Job(conf, "Word Count"); // specify output types job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // specify input and output dirs FileInputFormat.addInputPath(job, new Path(HADOOP_ROOT_DIR + "/input")); FileOutputFormat.setOutputPath(job, new Path(HADOOP_ROOT_DIR + "/output")); // specify a mapper job.setMapperClass(WordCountDriver.WordCountMapper.class); // specify a reducer job.setReducerClass(WordCountDriver.WordCountReducer.class); job.setCombinerClass(WordCountDriver.WordCountReducer.class); job.setJarByClass(WordCountDriver.WordCountMapper.class); return job.waitForCompletion(true) ? 0 : 1; } /** * * @param args * @throws Exception */ public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCountDriver(), args); System.exit(res); }

Answer 1

您所指的是哪个类路径并不完全清楚，但最后，如果您在远程 Hadoop集群上运行，则需要提供JAR文件中的所有类。在hadoop jar执行期间发送给Hadoop。本地程序的类路径无关紧要。

它可能在本地工作，因为您实际上在本地进程中运行Hadoop实例。因此，在这种情况下，碰巧能够在本地程序的类路径中找到类。

Answer 2

将类添加到hadoop类路径将使它们在客户端（即驱动程序）可用。

您的映射器和reducer需要在群集范围内可用，并且为了在hadoop上更容易，您将它们捆绑到一个jar中，并提供Job.setJarByClass（..）类，或者将它们添加到作业中classpath使用-libjars选项和GenericOptionsParser：

http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/util/GenericOptionsParser.html

在没有jar的情况下从java代码调用hadoop作业

2 个答案: