我是Hadoop MapReduce的新手。而且我在做字数统计。输入是文本文件的数量,我必须表示每个不同文件中每个单词的频率,并且必须形成术语向量表。我读过术语向量是apache lucene库的一部分。我对所有这些东西都是全新的,所以我不知道该怎么做,
我该怎么做..? 感谢
Output should look like as follows in table format
an apple is not orange the
Doc1 1 5 8 22 0 32
Doc2 0 6 10 19 0 13
Doc3 3 12 15 4 8 5
这是我的mapper类
public class Mapper2 extends Mapper<LongWritable, Text, Text, IntWritable>
{
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
// Get the file name
FileSplit split = (FileSplit) context.getInputSplit();
String filename = split.getPath().getName().toString();
// context.write(new Text(filename), new IntWritable(1));
String[] Stopwords= {"a","about","above","after","again","against","all","am","an",
"and","any","are","as","at","be","by","com","for","from","how","in","it","of",
"on","or","that","the","this","to","was","what","when","where","who","will","with",
"is","do","not","of","I","This"};
List<String> stopwordlist=Arrays.asList(Stopwords);
String s = value.toString();
for (String word : s.split("\\W+"))
{
if ((word.length() > 0)&&(!stopwordlist.contains(word)))
{
context.write(new Text(filename+word), new IntWritable(1));
}
}
}
}