在Hadoop中解释Wordcount

时间:2014-04-01 17:49:37

标签: java hadoop word-count

**我想知道以下几行的含义,我是java的新手,这是我作业的一部分。

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

    //According to my knowledge we are using this to set the line as a string
    String line = value.toString();

    //each string is now divided into indovidual words 
    StringTokenizer tokenizer = new StringTokenizer(line);

    //How are we setting the end limit of the loop ?
    while (tokenizer.hasMoreTokens()) {
        //what is word.set operation is doing here?
        word.set(tokenizer.nextToken());
    }

    //What is context ? and how are we giving the output to the reducer?
    context.write(word, one);
}

1 个答案:

答案 0 :(得分:3)

希望这会清除它。

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

    // We use this to get the String representation of the Text data type which is 
    // more suitable for distributed processing.
    String line = value.toString();

    // A tokenizer tokenizes (or divides) a sentence into individual words. It is 
    // deprecated though (not used anymore), so we should use line.split()
    // String[] tokens = line.split();
    StringTokenizer tokenizer = new StringTokenizer(line);

    // The tokenizer gives out a boolean (true or false) based on whether it has 
    // more tokens (words) or not. If split() is used, we can use a for loop.
    // for (String token : tokens) {
    //    word.set(token);
    while (tokenizer.hasMoreTokens()) {
        // I am guessing word is of Text type. Since like I previously said, Text 
        // data type is more suitable for distributed computing, we are converting 
        // the String token we have into text type. We have to define the word 
        // variable somewhere though.
        // If split() is used, we can write word.set(token);
        word.set(tokenizer.nextToken());
    }

    // Context is something which lets you pass key-value pairs forward. Once you 
    // write them using a Context object, the shuffle is performed and after the 
    // shuffle, they are grouped by key and each key along with its values is 
    // passed to the reducer.
    context.write(word, one);
}