I'm using Hadoop 2.7.1 and coding in Java. I'm able to run a simple mapreduce program where I provide a folder as input to the MapReduce program.
However I want to use a file (full paths are inside ) as input; this file contains all the other files to be processed by the mapper function.
Below is the file content,
/allfiles.txt
- /tmp/aaa/file1.txt
- /tmp/bbb/file2.txt
- /tmp/ccc/file3.txt
How can I specify the input path to MapReduce program as a file , so that it can start processing each file inside ? thanks.
答案 0 :(得分:0)
在您的驱动程序类中,您可以读入该文件,并将每行添加为输入文件:
//Read allfiles.txt and put each line into a List (requires at least Java 1.7)
List<String> files = Files.readAllLines(Paths.get("allfiles.txt"), StandardCharsets.UTF_8);
/Loop through the file names and add them as input
for(String file : files) {
//This Path is org.apache.hadoop.fs.Path
FileInputFormat.addInputPath(conf, new Path(file));
}
这假设您的allfiles.txt
对于运行MR作业的节点是本地的,但如果allfiles.txt
实际上在HDFS上,那只是一个很小的变化。
我强烈建议您在将HDFS添加为输入之前检查每个文件是否存在。
答案 1 :(得分:0)
您可以使用globs,而不是创建包含其他文件路径的文件。
在您的示例中,您可以将输入定义为-input /tmp/*/file?.txt