在MapReduce中操作用户输入字符串

时间:2015-04-20 14:49:04

标签: java string hadoop input mapreduce

我开始使用MapReduce的Hadoop变体,因此对输入和输出都没有任何线索。我理解它在概念上是如何运作的。

我的问题是在我提供的一堆文件中找到一个特定的搜索字符串。我对文件不感兴趣 - 这已经排序了。但是你会怎么去寻求输入?您是否会在该计划的JobConf部分内询问?如果是这样,我将如何将字符串传递给作业?

如果它在map()函数中,您将如何实现它?每次调用map()函数时,它不会只是要求搜索字符串吗?

以下是主要方法和JobConf()部分,可以给您一个想法:

public static void main(String[] args) throws IOException {

    // This produces an output file in which each line contains a separate word followed by
    // the total number of occurrences of that word in all the input files.

    JobConf job = new JobConf();

    FileInputFormat.setInputPaths(job, new Path("input"));
    FileOutputFormat.setOutputPath(job, new Path("output"));

    // Output from reducer maps words to counts.
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(LongWritable.class);

    // The output of the mapper is a map from words (including duplicates) to the value 1.
    job.setMapperClass(InputMapper.class);

    // The output of the reducer is a map from unique words to their total counts.
    job.setReducerClass(CountWordsReducer.class);

    JobClient.runJob(job);
}

map()函数:

public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException {

    // The key is the character offset within the file of the start of the line, ignored.
    // The value is a line from the file.

    //This is me trying to hard-code it. I would prefer an explanation on how to get interactive input!
    String inputString = "data"; 
    String line = value.toString();
    Scanner scanner = new Scanner(line);

    while (scanner.hasNext()) {
        if (line.contains(inputString)) {
            String line1 = scanner.next();
            output.collect(new Text(line1), new LongWritable(1));
        }
    }
    scanner.close();
}

我被引导相信我不需要减速器阶段来解决这个问题。任何建议/解释都非常感谢!

1 个答案:

答案 0 :(得分:2)

JobConf类是Configuration类的扩展,因此,您可以设置自定义属性:

JobConf job = new JobConf();
job.set("inputString", "data");
...

然后,正如Mapper的文档中所述: Mapper实现可以通过JobConfigurable.configure(JobConf)访问作业的JobConf并初始化它们。这意味着你有在Mapper中重新实现这样的方法,以获得所需的参数:

private static String inputString;

public void configure(JobConf job)
    inputString = job.get("inputString");
}

无论如何,这是使用旧的API。使用新的更容易访问配置,因为上下文(以及配置)作为参数传递给map方法。