找不到MapReduce停用词

时间:2016-05-10 16:28:16

标签: java hadoop stop-words

我是MapReduce的新手,我正在尝试编写一个程序来计算文件中停用词的数量。我从命令行引用了我的stopword.txt文件,但每次运行时,结果都是Stop Words = 0和Good Words = 30(应该是5& 25)。我没有得到任何例外,它的编译和运行没问题。我还有什么可以尝试的。 以下是我的代码。 Hadoop版本是2.0。

StopWord.java

public class StopWord {

public enum COUNTERS {
      STOPWORDS, GOODWORDS
     }
public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    GenericOptionsParser parser = new GenericOptionsParser(conf, args);
    args = parser.getRemainingArgs();

    Job job = new Job(conf, "StopWord");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    job.setJarByClass(StopWord.class);
    job.setMapperClass(MyMapper.class);
    job.setNumReduceTasks(0);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    List<String> other_args = new ArrayList<String>();
    for (int i = 0; i < args.length; i++) {
        if ("-skip".equals(args[i])) {
            DistributedCache.addCacheFile(new Path(args[++i]).toUri(),
                    job.getConfiguration());
            if (i+1 < args.length)
            {
                i++;
            }
            else
            {
                break;
            }
        }

        other_args.add(args[i]);
    }

    FileInputFormat.setInputPaths(job, new Path(other_args.get(0)));
    FileOutputFormat.setOutputPath(job, new Path(other_args.get(1)));
    job.waitForCompletion(true);
    Counters counters = job.getCounters();
    System.out.printf("Good Words: %d, Stop Words: %d\n",
              counters.findCounter(COUNTERS.GOODWORDS).getValue(),
              counters.findCounter(COUNTERS.STOPWORDS).getValue());
         }
    }

MyMapper.java

public class MyMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

private Text word = new Text();
private Set<String> stopWordList = new HashSet<String>();
private BufferedReader fis;

protected void setup(Context context) throws java.io.IOException,
        InterruptedException {

    try {
        Path[] stopWordFiles = new Path[0];
        stopWordFiles = context.getLocalCacheFiles();
        System.out.println(stopWordFiles.toString());
        if (stopWordFiles != null && stopWordFiles.length > 0) {
            for (Path stopWordFile : stopWordFiles) {
                readStopWordFile(stopWordFile);
            }
        }
    } catch (IOException e) {
        System.err.println("Exception reading stop word file: " + e);
    }
}

 //reading the stop word file
private void readStopWordFile(Path stopWordFile) {
    try {
        fis = new BufferedReader(new FileReader(stopWordFile.toString()));
        String stopWord = null;
        while ((stopWord = fis.readLine()) != null) {
            stopWordList.add(stopWord);
        }
    } catch (IOException e) {
        System.err.println("Exception while reading stop word file '"
                + stopWordFile + "' : " + e.toString());
    }
}

public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);

    while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken();
        if (stopWordList.contains(token)) {
            context.getCounter(StopWord.COUNTERS.STOPWORDS)
                    .increment(1);
        } else {
            context.getCounter(StopWord.COUNTERS.GOODWORDS)
                    .increment(1);
            word.set(token);
            context.write(word, null);
        }
    }
}
}

1 个答案:

答案 0 :(得分:0)

从我可以看到你的stopWordFiles可能是空的, 您在作业初始化后添加分布式缓存。

查看此帖子了解更多信息 Basic Types - Kotlin Programming Language