Question

我有一个可在本地正确执行的MapReduce程序。

它在mapper类的setup（）方法中使用一个名为new-positions.csv的文件来填充内存中的哈希表：

public void setup(Context context) throws IOException,  InterruptedException {
        newPositions = new Hashtable<String, Integer>();
        File file = new File("new-positions.csv");

        Scanner inputStream = new Scanner(file);
        String line = null;
        String firstline = inputStream.nextLine();
        while(inputStream.hasNext()){
            line = inputStream.nextLine();
            String[] splitLine = line.split(",");
            Integer id = Integer.valueOf(splitLine[0].trim());
            // String firstname = splitLine[1].trim();
            // String surname = splitLine[2].trim();
            String[] emails = new String[4];
            for (int i = 3; i < 7; i++) {
                emails[i-3] = splitLine[i].trim();
            }
            for (String email : emails) {
                if (!email.equals("")) newPositions.put(email, id);
            }
            // String position = splitLine[7].trim();
            inputStream.close();
        }   
    }

Java程序已导出为可执行的JAR。 JAR和full-positions.csv都保存在我们本地文件系统的同一目录中。

然后，在该目录中，我们在终端执行以下命令（我们也尝试使用new-positions.csv的完整路径名）：

hadoop jar MR2.jar Reader2 -files new-positions.csv InputDataset OutputFolder

它执行得很好，但当它到达映射器时我们得到：

Error: java.io.FileNotFoundException: new-positions.csv (No such file or directory)

这个文件肯定存在于本地，我们肯定是从该目录中执行的。

我们遵循Hadoop：The Definitive Guide（第4版）中给出的指导，p。 274以后，看不出我们的程序和论据在结构上有何不同。

这可能与Hadoop配置有关吗？我们知道有一些解决方法，例如将文件复制到HDFS然后从那里执行，但是我们需要理解为什么这个“-files”参数没有按预期工作。

编辑：下面是驱动程序类的一些代码，它们也可能是问题的根源：

public int run（String [] args）抛出IOException，InterruptedException，ClassNotFoundException { if（args.length！= 5）{ printUsage（this，“”）; 返回1; }

     Configuration config = getConf();

     FileSystem fs = FileSystem.get(config);

     Job job = Job.getInstance(config);
     job.setJarByClass(this.getClass());
     FileInputFormat.addInputPath(job, new Path(args[3]));

     // Delete old output if necessary
     Path outPath = new Path(args[4]);
     if (fs.exists(outPath)) 
         fs.delete(outPath, true);

     FileOutputFormat.setOutputPath(job, new Path(args[4]));

     job.setInputFormatClass(SequenceFileInputFormat.class);

     job.setOutputKeyClass(NullWritable.class);
     job.setOutputValueClass(Text.class);

     job.setMapOutputKeyClass(EdgeWritable.class);
     job.setMapOutputValueClass(NullWritable.class);

     job.setMapperClass(MailReaderMapper.class);
     job.setReducerClass(MailReaderReducer.class);

     job.setJar("MR2.jar");


     boolean status = job.waitForCompletion(true);
     return status ? 0 : 1;
 }

 public static void main(String[] args) throws Exception {
     int exitCode = ToolRunner.run(new Reader2(), args);
     System.exit(exitCode);
 }

Answer 1

假设您的“new-positions.csv”出现在文件夹：H:/HDP/中，那么您需要将此文件传递为：

file:///H:/HDP/new-positions.csv

您需要使用file:///限定路径，以指示它是本地文件系统路径。此外，您需要传递完全限定的路径。

这对我来说很有效。

例如，我传递本地文件myini.ini，如下所示：

yarn jar hadoop-mapreduce-examples-2.4.0.2.1.5.0-2060.jar teragen -files "file:///H:/HDP/hadoop-2.4.0.2.1.5.0-2060/share/hadoop/common/myini.ini" -Dmapreduce.job.maps=10 10737418 /usr/teraout/

Answer 2

我认为Manjunath Ballur给出了正确答案，但您传递的URI file:///home/local/xxx360/FinalProject/new-positions.csv可能无法从Hadoop工作机器中解析。

该路径看起来像是机器上的绝对路径，但哪台机器包含home？在路径中添加服务器，我认为它可能有用。

或者，如果您使用单数-file，看起来Hadoop将复制文件而不是像-files那样创建符号链接。

请参阅文档here。

Answer 3

我在您的代码中没有看到任何错误。从与您在技术上相同的工作代码中，当我将java.io.FileNotFoundException添加到文件名时，我也得到了-。删除-，然后重试：

        File file = new File("newpositions.csv");

hadoop jar MR2.jar Reader2 -files newpositions.csv InputDataset OutputFolder

使用-files参数将文件传递给Hadoop

3 个答案: