在Mapreduce中实现一个简单的排序程序时出错,零减少节点

时间:2012-03-24 06:34:57

标签: sorting hadoop mapreduce

我尝试在mapreduce中实现一个排序程序,这样我在地图阶段之后只有排序的输出,其中内部由hadoop框架完成排序。为此,我尝试将减少任务的数量设置为零,因为没有任何减少需要。现在,当我尝试执行程序时,我继续获得校验和 错误..我无法弄清楚接下来要做什么。当然,我可以在我的上网本上运行程序,因为当我将reduce任务设置为1时,排序工作正常..请帮助!


供您参考,这是我为执行排序而编写的整个代码:

    /*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

/**
 *
 * @author root
 */
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.io.*;
import java.util.*;
import java.io.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.*;
import org.apache.hadoop.conf.*;


public class word extends Configured implements Tool
{
    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
    {
        private static IntWritable one=new IntWritable(1);
        private Text word=new Text();

        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter report) throws IOException
        {
            String line=value.toString();
            StringTokenizer token=new StringTokenizer(line," .,?!");
            String wordToken=null;

            while(token.hasMoreTokens())
            {
                wordToken=token.nextToken();
                output.collect(new Text(wordToken), one);

            }
        }

    }

    public int run(String args[])throws Exception
    {
        //Configuration conf=getConf();
        JobConf job=new JobConf(word.class);
        job.setInputFormat(TextInputFormat.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setOutputFormat(TextOutputFormat.class);
        job.setMapperClass(Map.class);
        job.setNumReduceTasks(0);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        JobClient.runJob(job);

        return 0;
    }

    public static void main(String args[])throws Exception
    {
        int exitCode=ToolRunner.run(new word(), args);
        System.exit(exitCode);

    }
}

以下是执行此程序时出现的校验和错误:

12/03/25 10:26:42 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
12/03/25 10:26:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/25 10:26:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/25 10:26:44 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/25 10:26:45 INFO mapred.JobClient: Running job: job_local_0001
12/03/25 10:26:45 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/25 10:26:45 INFO mapred.MapTask: numReduceTasks: 0
12/03/25 10:26:45 INFO fs.FSInputChecker: Found checksum error: b[0, 26]=610a630a620a640a650a740a790a780a730a670a7a0a680a730a
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/NetBeansProjects/projectAll/output/regionMulti/individual/part-00000 at 0
        at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
        at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/25 10:26:45 WARN mapred.LocalJobRunner: job_local_0001
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/NetBeansProjects/projectAll/output/regionMulti/individual/part-00000 at 0
        at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
        at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
        at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
12/03/25 10:26:46 INFO mapred.JobClient:  map 0% reduce 0%
12/03/25 10:26:46 INFO mapred.JobClient: Job complete: job_local_0001
12/03/25 10:26:46 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at sortLog.run(sortLog.java:59)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at sortLog.main(sortLog.java:66)
Java Result: 1
BUILD SUCCESSFUL (total time: 4 seconds)

3 个答案:

答案 0 :(得分:2)

所以看看0.20.2中的org.apache.hadoop.mapred.MapTask arround 600行。

  // get an output object
  if (job.getNumReduceTasks() == 0) {
     output =
       new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
  } else {
    output = new NewOutputCollector(taskContext, job, umbilical, reporter);
  }

如果将reduce任务的数量设置为零,它将直接写入输出。 NewOutputCollector将使用所谓的MapOutputBuffer来执行溢出,排序,组合和分区。

因此,当您没有设置减速器时,即使Tom White在权威指南中说明了这一点,也不会进行排序。

答案 1 :(得分:1)

我遇到了同样的问题(关于文件part-00000的校验和错误为0)。我通过将文件重命名为-00000以外的任何其他名称来解决它。

答案 2 :(得分:0)

因此,如果您需要至少一个Reducer来进行内部排序,那么您可以使用IdentityReducer。

您可能还想看到此讨论: hadoop: difference between 0 reducer and identity reducer?