如何通过Hadoop mapreduce WordCount对最频繁重复的单词列表进行排序?

时间:2012-07-24 11:23:39

标签: java hadoop mapreduce

嗨,我是hadoop mapreduce的新手。

你们中的任何人都可以帮我修改下面发布的代码以显示所需的输出吗?

我有一个给定的输入文件

输入Hi my name is John.Im doing my engineering.My parents stay at California

我得到输出为

Hi    1
my   3
name  1
is    1
is 1
John 1
doing  1
engineering 1
parents  1
stay  1
at  1
California   1

但我希望将输出排序

 my   3
 Hi   1 
 etc.....

然后显示所有其他人。这个概念是显示重复最多次的单词应该先排序和显示。

我在单个节点上运行此作业。我正在做这份工作

        $ hadoop jar job.jar input output

我已经开始了

        $ hadoop namenode -format
        $ hadoop namenode

        $ hadoop datanode
        sbin$ ./yarn-daemon.sh start resourcemanager 
        sbin$ ./yarn-daemon.sh start resourcemanager

我正在运行hadoop-2.0.0-cdh4.0.0

        package org.apache.hadoop.examples;

        import java.io.IOException;
        import java.util.StringTokenizer;
        import org.apache.commons.logging.Log;
        import org.apache.commons.logging.LogFactory;

        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.io.IntWritable;
        import org.rg.apache.hadoop.fs.Path;
        import oapache.hadoop.io.Text;
        import org.apache.hadoop.mapreduce.Job;
        import org.apache.hadoop.mapreduce.Mapper;
        import org.apache.hadoop.mapreduce.Reducer;
        import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
        import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
        import org.apache.hadoop.util.GenericOptionsParser;

        public class WordCount {
        private static final Log LOG = LogFactory.getLog(WordCount.class);

          public static class TokenizerMapper
               extends Mapper<Object, Text, Text, IntWritable>{

            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();

            public void map(Object key, Text value, Context context
                            ) throws IOException, InterruptedException {
              StringTokenizer itr = new StringTokenizer(value.toString());
              while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
              }
            }
          }

          public static class IntSumReducer
               extends Reducer<Text,IntWritable,Text,IntWritable> {
            private IntWritable result = new IntWritable();

            public void reduce(Text key, Iterable<IntWritable> values,
                               Context context
                               ) throws IOException, InterruptedException {
              int sum = 0;
              //printKeyAndValues(key, values);

              for (IntWritable val : values) {
                sum += val.get();
              LOG.info("val = " + val.get());
              }
              LOG.info("sum = " + sum + " key = " + key);
              result.set(sum);
              context.write(key, result);
              //System.err.println(String.format("[reduce] word: (%s), count: (%d)", key, result.get()));
            }


          // a little method to print debug output
            private void printKeyAndValues(Text key, Iterable<IntWritable> values)
            {
              StringBuilder sb = new StringBuilder();
              for (IntWritable val : values)
              {
                sb.append(val.get() + ", ");
              }
              System.err.println(String.format("[reduce] key: (%s), value: (%s)", key, sb.toString()));
            }
          }

          public static void main(String[] args) throws Exception {
            Configuration conf = new Configuration();
            String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
            if (otherArgs.length != 2) {
              System.err.println("Usage: wordcount <in> <out>");
              System.exit(2);
            }
            Job job = new Job(conf, "word count");
            job.setJarByClass(WordCount.class);
            job.setMapperClass(TokenizerMapper.class);
            job.setCombinerClass(IntSumReducer.class);
            job.setReducerClass(IntSumReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
        }

如果有人能排除这种想法,我会很棒。

1 个答案:

答案 0 :(得分:2)

每次找到单词时减少计数怎么样?从0开始,你将得到数字。最重要的数字应该是第一个。