Question

我使用的是Hadoop 1.0.3。

我将日志写入Hadoop序列文件到HDFS，我在每堆日志之后调用syncFS（），但我从不关闭文件（除非我每天都在执行滚动）。

我想保证的是，在文件仍在写入时，文件可供读者使用。

我可以通过FSDataInputStream读取序列文件的字节，但是如果我尝试使用SequenceFile.Reader.next（key，val），它会在第一次调用时返回false。

我知道数据在文件中，因为我可以使用FSDataInputStream或cat命令读取数据，我100％确定调用了syncFS（）。

我检查了namenode和datanode日志，没有错误或警告。

为什么SequenceFile.Reader无法读取我当前正在写的文件？

Answer 1

您无法确保读取完全写入datanode端的磁盘。您可以在DFSClient#DFSOutputStream.sync()的文档中看到这一点：

  All data is written out to datanodes. It is not guaranteed that data has
  been flushed to persistent store on the datanode. Block allocations are
  persisted on namenode.

因此它基本上用当前信息更新namenode的块映射，并将数据发送到datanode。由于您无法将数据刷新到datanode上的磁盘，但是您直接从datanode读取数据，该数据节点位于数据缓冲且无法访问的时间范围内。因此，您的序列文件读取器将认为数据流已完成（或为空），并且无法读取返回false以进行反序列化过程的其他字节。

如果完全接收到块，则datanode将数据写入磁盘（事先写入，但不能从外部读取）。因此，一旦达到块大小或者文件已经关闭，您就可以从文件中读取，从而最终确定一个块。这在分布式环境中是完全有意义的，因为你的编写者可能会死而不能正确地完成一个块 - 这是一致的问题。

因此，修复方法是使块大小非常小，以便更频繁地完成块。但这并不是那么有效，我希望你的要求不适合HDFS。

Answer 2

SequenceFile.Reader无法读取正在写入的文件的原因是它使用文件长度来执行其魔法。

文件长度在写入第一个块时保持为0，并且仅在块已满时更新（默认为64MB）。然后文件大小停留在64MB，直到第二个块完全写入，等等......

这意味着您无法使用SequenceFile.Reader读取序列文件中的最后一个不完整块，即使原始数据可以直接使用FSInputStream读取。

关闭文件也会修复文件长度，但在我的情况下，我需要在文件关闭之前读取文件。

Answer 3

所以我遇到了同样的问题，经过一些调查和时间后，我认为以下的解决方法有效。

所以问题是由于序列文件创建的内部实现以及它使用的文件长度是每块64 MB更新的事实。

所以我创建了下面的类来创建阅读器，我用自己的方法包装了hadoop FS，而我重写了get length方法来返回文件长度：

public class SequenceFileUtil {

    public SequenceFile.Reader createReader(Configuration conf, Path path) throws IOException {

        WrappedFileSystem fileSystem = new WrappedFileSystem(FileSystem.get(conf));

        return new SequenceFile.Reader(fileSystem, path, conf);
    }

    private class WrappedFileSystem extends FileSystem
    {
        private final FileSystem nestedFs;

        public WrappedFileSystem(FileSystem fs){
            this.nestedFs = fs;
        }

        @Override
        public URI getUri() {
            return nestedFs.getUri();
        }

        @Override
        public FSDataInputStream open(Path f, int bufferSize) throws IOException {
            return nestedFs.open(f,bufferSize);
        }

        @Override
        public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException {
            return nestedFs.create(f, permission,overwrite,bufferSize, replication, blockSize, progress);
        }

        @Override
        public FSDataOutputStream append(Path f, int bufferSize, Progressable progress) throws IOException {
            return nestedFs.append(f, bufferSize, progress);
        }

        @Override
        public boolean rename(Path src, Path dst) throws IOException {
            return nestedFs.rename(src, dst);
        }

        @Override
        public boolean delete(Path path) throws IOException {
            return nestedFs.delete(path);
        }

        @Override
        public boolean delete(Path f, boolean recursive) throws IOException {
            return nestedFs.delete(f, recursive);
        }

        @Override
        public FileStatus[] listStatus(Path f) throws FileNotFoundException, IOException {
            return nestedFs.listStatus(f);
        }

        @Override
        public void setWorkingDirectory(Path new_dir) {
            nestedFs.setWorkingDirectory(new_dir);
        }

        @Override
        public Path getWorkingDirectory() {
            return nestedFs.getWorkingDirectory();
        }

        @Override
        public boolean mkdirs(Path f, FsPermission permission) throws IOException {
            return nestedFs.mkdirs(f, permission);
        }

        @Override
        public FileStatus getFileStatus(Path f) throws IOException {
            return nestedFs.getFileStatus(f);
        }


        @Override
        public long getLength(Path f) throws IOException {

            DFSClient.DFSInputStream open =  new DFSClient(nestedFs.getConf()).open(f.toUri().getPath());
            long fileLength = open.getFileLength();
            long length = nestedFs.getLength(f);

            if (length < fileLength){
                //We might have uncompleted blocks
                return fileLength;
            }

            return length;
        }


    }
}

Answer 4

我遇到了类似的问题，这是我修复它的方法： http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201303.mbox/%3CCALtSBbY+LX6fiKutGsybS5oLXxZbVuN0WvW_a5JbExY98hJfig@mail.gmail.com%3E

Hadoop HDFS：读取正在写入的序列文件

4 个答案: