Hadoop mapreduce多个文件的字数

时间:2014-04-19 14:33:33

标签: hadoop mapreduce word-count

我是Hadoop MapReduce的新手。而且我在做字数统计。输入是文本文件的数量,我必须表示每个不同文件中每个单词的频率,并且必须形成术语向量表。我读过术语向量是apache lucene库的一部分。我对所有这些东西都是全新的,所以我不知道该怎么做,

我该怎么做..? 感谢

Output should look like as follows in table format
       an apple is not orange the 
Doc1    1   5   8   22   0    32 
Doc2    0   6   10  19   0    13 
Doc3    3   12  15   4   8     5  

这是我的mapper类

public class Mapper2 extends Mapper<LongWritable, Text, Text, IntWritable>
{   
public void map(LongWritable key, Text value, Context context)  
    throws IOException, InterruptedException
    {
// Get the file name
    FileSplit split = (FileSplit) context.getInputSplit();
     String filename = split.getPath().getName().toString();

//   context.write(new Text(filename), new IntWritable(1));
    String[] Stopwords=   {"a","about","above","after","again","against","all","am","an",
            "and","any","are","as","at","be","by","com","for","from","how","in","it","of",
            "on","or","that","the","this","to","was","what","when","where","who","will","with",
            "is","do","not","of","I","This"};
    List<String> stopwordlist=Arrays.asList(Stopwords);
    String s = value.toString();
    for (String word : s.split("\\W+")) 
        {           
        if ((word.length() > 0)&&(!stopwordlist.contains(word))) 
            {   
            context.write(new Text(filename+word), new IntWritable(1));
            }       
        }

    }

}

0 个答案:

没有答案