Question

我正在使用Java API将我的一些自定义文件转换为hadoop Sequence Files。

我正从本地文件中读取字节数组，并将它们作为索引（整数） - 数据（字节[]）对附加到序列文件中：

InputStream in = new BufferedInputStream(new FileInputStream(localSource));
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf);
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq");

IntWritable key = new IntWritable();
BytesWritable value = new BytesWritable();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
            sequenceFilePath, key.getClass(), value.getClass());

     for (int i = 1; i <= nz; i++) {
     byte[] imageData = new byte[nx * ny * 2];
     in.read(imageData);

     key.set(i);
     value.set(imageData, 0, imageData.length);
     writer.append(key, value);
     }
IOUtils.closeStream(writer);
in.close();

当我想将文件恢复为初始格式时，我做了相反的事情：

    for (int i = 1; i <= nz; i++) {
        reader.next(key, value);
        int byteLength = value.getLength();
        byte[] tempValue = value.getBytes();
        out.write(tempValue, 0, byteLength);
        out.flush();
    }

我注意到写入SequenceFile所需的内容比读取的内容多了几个数量级。我希望写入比读取慢，但这种差异是正常的吗？为什么呢？

更多信息： 我读取的字节数组是2MB大小（nx = ny = 1024和nz = 128）
我正在以伪分布式模式进行测试。

Answer 1

您正在从本地磁盘读取并写入HDFS。当您写入HDFS时，您的数据可能正在被复制，因此根据您为复制因子设置的内容，它会被物理写入两到三次。

因此，您不仅要写作，而且要写入您正在阅读的数据量的两到三倍。而你的写作正在通过网络进行。你的读物不是。

Answer 2

nx和ny常量？

您可能会看到这一点的一个原因是for循环的每次迭代都会创建一个新的字节数组。这需要JVM为您分配一些堆空间。如果阵列足够大，这将是昂贵的，并最终你将遇到GC。我不太确定HotSpot可能会做些什么来优化它。

我的建议是创建一个BytesWritable：

// use DataInputStream so you can call readFully()
DataInputStream in = new DataInputStream(new FileInputStream(localSource));
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf);
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq");

IntWritable key = new IntWritable();
// create a BytesWritable, which can hold the maximum possible number of bytes
BytesWritable value = new BytesWritable(new byte[maxPossibleSize]);
// grab a reference to the value's underlying byte array
byte byteBuf[] = value.getBytes();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
        sequenceFilePath, key.getClass(), value.getClass());

for (int i = 1; i <= nz; i++) {
  // work out how many bytes to read - if this is a constant, move outside the for loop
  int imageDataSize nx * ny * 2;
  // read in bytes to the byte array
  in.readFully(byteBuf, 0, imageDataSize);

   key.set(i);
   // set the actual number of bytes used in the BytesWritable object
   value.setSize(imageDataSize);
   writer.append(key, value);
}

IOUtils.closeStream(writer);
in.close();

为什么Hadoop SequenceFile写入比读取慢得多？

2 个答案: