Question

我是Hadoop的新手，我使用MapReduce Java Code解决了一个问题。我有一个这样的文件，在每一行中都有一个日期和一些单词（A，B，C ..）：

2016-05-10，A，B，C，A，R，E，F，E
2016-05-18，A，B，F，E，E
2016-06-01，A，B，K，T，T，E，G，E，A，N
2016-06-03，A，B，K，T，T，E，F，E，L，T

我实施了Map Reduce算法，每个月我都可以找到每个单词的总发生次数，为此我可以说出最大的2是什么occurrencies。我正在考虑这种结果：

2016-05，A：3，E：4
2016-06，T：5，E：4

我尝试了两种不同的解决方案来寻找方法： - 第一个：

public class GiulioTest {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line,",");
    String dataAttuale = tokenizer.nextToken().substring(0, line.lastIndexOf("-"));

    while (tokenizer.hasMoreTokens()) {
        String prod = tokenizer.nextToken(",");

            word.set(dataAttuale + ":" + prod);           
            context.write(word, one);

    }

}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();    

public void reduce(Text key, Iterator<IntWritable> values, Context context)
  throws IOException, InterruptedException {
    int sum = 0;
    while (values.hasNext()) {
        sum += values.next().get();
    }
    result.set(sum)
    context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(GiulioTest.class);
job.waitForCompletion(true);
  }

}

我期待这段代码能够给我这样的结果：

2016-05：A 3
2016-05：E 4
2016-05：...
2016-05：其他信件
2016-06：T 5
2016-06：...

然后找到一种方法来查找带有最大值的前两个字母。实际上，我不知道在这一点上是否有办法重新计算密钥，以提取最大值。任何人都有建议吗？

我想到的另一个解决方案，但只是在伪代码中，我不知道是否可以使用MapReduce Framework：定义文本键，定义List listValues，定义finalMap，//带有值的Map是另一个String和Integer的映射

mapper(key,value,context) {
   month = //retrieve using String tokenizer splitting(',')
   tmpKey = month;
   while(itr.hasMoreToken()) {
        listValues.add(itr.nextToken())
   }
   key.set(tmpKey)
   context.put(key, listValues) //And here, there is my first doubt, if  is it possible to set in context something like context(Text,List<String>)
 }

reduce(Text key, Iterable<List<String>> values, Context context) {
   Map<String,Int> letterVal = new ...;
   for(List<String> listLetter : values) {
      while(listLetter.haContent()) {
         String letter = listLetter.next()
         if(letterVal.contains(letter)) {
             Int tmpVal = letterVal.get(letter)
             letterVal.put(letter, tmpVal+1);
         } else 
             letterVal.put(letter,1)
       }
     }
     finalMap.put(key, letterVal)
     context.write(finalMap.get(key), finalMap.toString)
 }

Hadoop MapReduce Java按日期计算Occurrency

0 个答案: