Hadoop MapReduce Java按日期计算Occurrency

时间:2017-06-02 10:28:34

标签: java hadoop mapreduce

我是Hadoop的新手,我使用MapReduce Java Code解决了一个问题。 我有一个这样的文件,在每一行中都有一个日期和一些单词(A,B,C ..):

  • 2016-05-10,A,B,C,A,R,E,F,E
  • 2016-05-18,A,B,F,E,E
  • 2016-06-01,A,B,K,T,T,E,G,E,A,N
  • 2016-06-03,A,B,K,T,T,E,F,E,L,T

我实施了Map Reduce算法,每个月我都可以找到每个单词的总发生次数,为此我可以说出最大的2是什么occurrencies。  我正在考虑这种结果:

  • 2016-05,A:3,E:4
  • 2016-06,T:5,E:4

我尝试了两种不同的解决方案来寻找方法:   - 第一个:

public class GiulioTest {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line,",");
    String dataAttuale = tokenizer.nextToken().substring(0, line.lastIndexOf("-"));

    while (tokenizer.hasMoreTokens()) {
        String prod = tokenizer.nextToken(",");

            word.set(dataAttuale + ":" + prod);           
            context.write(word, one);

    }

}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();    

public void reduce(Text key, Iterator<IntWritable> values, Context context)
  throws IOException, InterruptedException {
    int sum = 0;
    while (values.hasNext()) {
        sum += values.next().get();
    }
    result.set(sum)
    context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(GiulioTest.class);
job.waitForCompletion(true);
  }

}

我期待这段代码能够给我这样的结果:

  • 2016-05:A 3
  • 2016-05:E 4
  • 2016-05:...
  • 2016-05:其他信件
  • 2016-06:T 5
  • 2016-06:...

然后找到一种方法来查找带有最大值的前两个字母。 实际上,我不知道在这一点上是否有办法重新计算密钥,以提取最大值。任何人都有建议吗?

我想到的另一个解决方案,但只是在伪代码中,我不知道是否可以使用MapReduce Framework:     定义文本键,     定义List listValues,     定义finalMap,//带有值的Map是另一个String和Integer的映射

mapper(key,value,context) {
   month = //retrieve using String tokenizer splitting(',')
   tmpKey = month;
   while(itr.hasMoreToken()) {
        listValues.add(itr.nextToken())
   }
   key.set(tmpKey)
   context.put(key, listValues) //And here, there is my first doubt, if  is it possible to set in context something like context(Text,List<String>)
 }

reduce(Text key, Iterable<List<String>> values, Context context) {
   Map<String,Int> letterVal = new ...;
   for(List<String> listLetter : values) {
      while(listLetter.haContent()) {
         String letter = listLetter.next()
         if(letterVal.contains(letter)) {
             Int tmpVal = letterVal.get(letter)
             letterVal.put(letter, tmpVal+1);
         } else 
             letterVal.put(letter,1)
       }
     }
     finalMap.put(key, letterVal)
     context.write(finalMap.get(key), finalMap.toString)
 }

0 个答案:

没有答案