Question

I'm using Hadoop 2.7.1 and coding in Java. I'm able to run a simple mapreduce program where I provide a folder as input to the MapReduce program.

However I want to use a file (full paths are inside ) as input; this file contains all the other files to be processed by the mapper function.

Below is the file content,

/allfiles.txt
    - /tmp/aaa/file1.txt
    - /tmp/bbb/file2.txt
    - /tmp/ccc/file3.txt

How can I specify the input path to MapReduce program as a file , so that it can start processing each file inside ? thanks.

Answer 1

在您的驱动程序类中，您可以读入该文件，并将每行添加为输入文件：

//Read allfiles.txt and put each line into a List (requires at least Java 1.7)
List<String> files = Files.readAllLines(Paths.get("allfiles.txt"), StandardCharsets.UTF_8);

/Loop through the file names and add them as input
for(String file : files) {
     //This Path is org.apache.hadoop.fs.Path 
     FileInputFormat.addInputPath(conf, new Path(file));
}

这假设您的allfiles.txt对于运行MR作业的节点是本地的，但如果allfiles.txt实际上在HDFS上，那只是一个很小的变化。

我强烈建议您在将HDFS添加为输入之前检查每个文件是否存在。

Answer 2

您可以使用globs，而不是创建包含其他文件路径的文件。

在您的示例中，您可以将输入定义为-input /tmp/*/file?.txt

how to use file (many file's full path inside) as input to MapReduce job

2 个答案:

how to use file (many file&#39;s full path inside) as input to MapReduce job

2 个答案:

how to use file (many file's full path inside) as input to MapReduce job