Question

我正在尝试在hadoop作业的main方法中读取文件。不在mapper或reducer中。我正在使用EMR亚马逊和CUSTOM JAR

The command line is arguments: -files s3://[path]#source.xml

在我正在做的主要功能中：

File file = new File("source.xml")

我不知道分布式缓存是在主函数上还是在mapper / reducer函数中可用。我是否需要使用DistributedCache API？

AWS正在执行的行代码：

hadoop jar /mnt/var/lib/hadoop/steps/s-1YBXTPYJ2YK44/JobTeste_SomenteLeitura.jar -files s3://stoneagebrasil/TesteBVS/sources.xml

如何做到这一点？

Answer 1

试，

FileSystem fs = FileSystem.get(configuration);
Path path = new Path("test.txt");

读取文件，

BufferedReader br = new BufferedReader(new InputStreamReader(
                fs.open(path)));
        String line;
        line = br.readLine();
        while (line != null) {
            System.out.println(line);
            line = br.readLine();
        }

Answer 2

到目前为止，我发现无法在hadoop驱动程序（主函数）中读取分布式缓存中的文件。这是因为我在启动作业后将文件分发（复制到从属节点）。

解决方案直接从S3读取文件。

在主函数内部读取文件 - Hadoop

2 个答案: