Hadoop - 应该是关键和价值

时间:2012-09-11 14:57:15

标签: hadoop

我是Hadoop新手。

我的目标是上传大号。将具有不同扩展名的文件放到Hadoop集群上并获得如下输出:

扩展文件数

.jpeg 1000 .java 600 .txt 3000

等等。

我假设文件名必须是mapper方法的关键,以便我可以阅读扩展(以及将来做其他文件操作)

          public void map(Text fileName,
                   null/*will this do - value isn't required in this case*/,
                   OutputCollector<Text,IntWritable> output,
                   Reporter reporter)
                   throws IOException
           {
            Text extension = new Text(FilenameUtils.getExtension(filename));
            output.collect(extension, 1); 
        }

          public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 
          int sum = 0; 
          while (values.hasNext()) { 
          sum += values.next().get(); 
          } 
          output.collect(key, new IntWritable(sum)); 
          }
          }

查询:

  1. 如何将文件的名称作为密钥发送到Mapper?我正在考虑实现 RecordReader 接口,但不确定它是否是必需的,但也无法确定要使用哪个实现类!
  2. 根据API和我的理解, InputFormat 实现负责提供处理拆分 - 我是否必须在此处执行某些操作才能完成工作?
  3. 请指导我,以防我对Hadoop MapReduce的概念做出任何根本不正确的假设。

    -------------------第一次编辑-------------------

    附加代码,输出和查询:

    /**
     * 
     */
    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.FileOutputFormat;
    import org.apache.hadoop.mapred.JobClient;
    import org.apache.hadoop.mapred.JobConf;
    
    
    public class Main {
    
        /**
         * @param args
         * @throws IOException
         */
        public static void main(String[] args) throws IOException {
            // TODO Auto-generated method stub
    
            Main main = new Main();
    
            if (args == null || args.length == 0) {
                throw new RuntimeException("Enter path to read files");
            }
    
            main.groupFilesByExtn(args);
        }
    
        private void groupFilesByExtn(String[] args) throws IOException {
            // TODO Auto-generated method stub
    
            JobConf conf = new JobConf(Main.class);
            conf.setJobName("Grp_Files_By_Extn");
    
            /* InputFormat and OutputFormat from 'mapred' package ! */
            conf.setInputFormat(CustomFileInputFormat.class);
            conf.setOutputFormat(org.apache.hadoop.mapred.TextOutputFormat.class);
    
            /* No restrictions here ! */
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);
    
            /* Mapper and Reducer classes from 'mapred' package ! */
            conf.setMapperClass(CustomMapperClass.class);
            conf.setReducerClass(CustomReducer.class);
            conf.setCombinerClass(CustomReducer.class);
    
            CustomFileInputFormat.setInputPaths(conf, new Path(args[0]));
            FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    
            JobClient.runJob(conf);
        }
    
    }
    

    自定义FileInputFormat

    /**
     * 
     */
    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.mapred.FileInputFormat;
    import org.apache.hadoop.mapred.FileSplit;
    import org.apache.hadoop.mapred.InputSplit;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.RecordReader;
    import org.apache.hadoop.mapred.Reporter;
    
    public class CustomFileInputFormat extends
            FileInputFormat<String, NullWritable> {
    
        @Override
        public RecordReader<String, NullWritable> getRecordReader(InputSplit aFile,
                JobConf arg1, Reporter arg2) throws IOException {
            // TODO Auto-generated method stub
    
            System.out.println("In CustomFileInputFormat.getRecordReader(...)");
            /* the cast - ouch ! */
            CustomRecordReader custRecRdr = new CustomRecordReader(
                    (FileSplit) aFile);
    
            return custRecRdr;
        }
    
    }
    

    自定义RecordReader

    /**
     * 
     */
    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.FileSplit;
    import org.apache.hadoop.mapred.InputSplit;
    import org.apache.hadoop.mapred.RecordReader;
    
    public class CustomRecordReader implements RecordReader<String, NullWritable> {
    
        private FileSplit aFile;
        private String fileName;
    
        public CustomRecordReader(FileSplit aFile) {
    
            this.aFile = aFile;
    
            System.out.println("In CustomRecordReader constructor aFile is "
                    + aFile.getClass().getName());
        }
    
        @Override
        public void close() throws IOException {
            // TODO Auto-generated method stub
    
        }
    
        @Override
        public String createKey() {
            // TODO Auto-generated method stub
            fileName = aFile.getPath().getName();
    
            System.out.println("In CustomRecordReader.createKey() "+fileName);
    
            return fileName;
        }
    
        @Override
        public NullWritable createValue() {
            // TODO Auto-generated method stub
            return null;
        }
    
        @Override
        public long getPos() throws IOException {
            // TODO Auto-generated method stub
            return 0;
        }
    
        @Override
        public float getProgress() throws IOException {
            // TODO Auto-generated method stub
            return 0;
        }
    
        @Override
        public boolean next(String arg0, NullWritable arg1) throws IOException {
            // TODO Auto-generated method stub
            return false;
        }
    
    }
    

    映射

    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    
    import org.apache.commons.io.FilenameUtils;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.MapReduceBase;
    import org.apache.hadoop.mapred.Mapper;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reporter;
    
    public class CustomMapperClass extends MapReduceBase implements
            Mapper<String, NullWritable, Text, IntWritable> {
    
        private static final int COUNT = 1;
    
        @Override
        public void map(String fileName, NullWritable value,
                OutputCollector<Text, IntWritable> outputCollector,
                Reporter reporter) throws IOException {
            // TODO Auto-generated method stub
            System.out.println("In CustomMapperClass.map(...) : key " + fileName
                    + " value = " + value);
    
            outputCollector.collect(new Text(FilenameUtils.getExtension(fileName)),
                    new IntWritable(COUNT));
    
            System.out.println("Returning from CustomMapperClass.map(...)");
        }
    
    }
    

    减速器:

    /**
     * 
     */
    package com.hadoop.mapred.scratchpad;
    
    import java.io.IOException;
    import java.util.Iterator;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.MapReduceBase;
    import org.apache.hadoop.mapred.OutputCollector;
    import org.apache.hadoop.mapred.Reducer;
    import org.apache.hadoop.mapred.Reporter;
    
    
    public class CustomReducer extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable> {
    
        @Override
        public void reduce(Text fileExtn, Iterator<IntWritable> countCollection,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException {
            // TODO Auto-generated method stub
    
            System.out.println("In CustomReducer.reduce(...)");
            int count = 0;
    
            while (countCollection.hasNext()) {
                count += countCollection.next().get();
            }
    
            output.collect(fileExtn, new IntWritable(count));
    
            System.out.println("Returning CustomReducer.reduce(...)");
        }
    
    }
    

    输出(hdfs)目录:

    hd@cloudx-538-520:~/hadoop/logs/userlogs$ hadoop fs -ls /scratchpad/output
    Warning: $HADOOP_HOME is deprecated.
    
    Found 3 items
    -rw-r--r--   4 hd supergroup          0 2012-10-11 20:52 /scratchpad/output/_SUCCESS
    drwxr-xr-x   - hd supergroup          0 2012-10-11 20:51 /scratchpad/output/_logs
    -rw-r--r--   4 hd supergroup          0 2012-10-11 20:52 /scratchpad/output/part-00000
    hd@cloudx-538-520:~/hadoop/logs/userlogs$
    hd@cloudx-538-520:~/hadoop/logs/userlogs$ hadoop fs -ls /scratchpad/output/_logs
    Warning: $HADOOP_HOME is deprecated.
    
    Found 1 items
    drwxr-xr-x   - hd supergroup          0 2012-10-11 20:51 /scratchpad/output/_logs/history
    hd@cloudx-538-520:~/hadoop/logs/userlogs$
    hd@cloudx-538-520:~/hadoop/logs/userlogs$
    

    日志(只打开一个):

    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$ ls -lrt
    total 16
    -rw-r----- 1 hd hd 393 2012-10-11 20:52 job-acls.xml
    lrwxrwxrwx 1 hd hd  95 2012-10-11 20:52 attempt_201210091538_0019_m_000000_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000000_0
    lrwxrwxrwx 1 hd hd  95 2012-10-11 20:52 attempt_201210091538_0019_m_000002_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000002_0
    lrwxrwxrwx 1 hd hd  95 2012-10-11 20:52 attempt_201210091538_0019_m_000001_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000001_0
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$ cat attempt_201210091538_0019_m_000000_0/stdout
    In CustomFileInputFormat.getRecordReader(...)
    In CustomRecordReader constructor aFile is org.apache.hadoop.mapred.FileSplit
    In CustomRecordReader.createKey() ExtJS_Notes.docx
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
    hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
    

    如上所示:

    1. HDFS上的输出是0kb文件
    2. 日志仅显示sysout,直到该线程位于CustomRecordReader
    3. 我错过了什么?

1 个答案:

答案 0 :(得分:1)

Kaliyug,

根据您的需要,无需将文件名传递给mapper。它已在mapper中提供。只需访问如下。其余的很简单,只是模仿简单的字数计划。

  FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
  String fileName = fileSplit.getPath().getName();

如果是新API,记者需要更改为上下文

对于性能优化,您只需创建一个记录阅读器,它将简单地提供文件名作为映射器的键(与上面相同的方法)。使记录阅读器不读取任何文件内容。使值部分为NullWritable。

Mapper会将文件名作为密钥。只是发射到减速器&lt; file_extension,1&gt; as&lt;键,值&gt;对。

Reducer需要执行与wordcount相同的逻辑。