Hadoop thriftfs读取中的额外EFBFBD字节

时间:2014-04-22 10:10:12

标签: java python hadoop character-encoding thrift

在hadoop-0.20中,我们有一个thriftfs contrib,它允许我们用其他编程语言访问hdfs。 Hadoop提供了一个 hdfs.py 脚本用于演示。问题位于do_getdo_put方法中。

如果我们使用get下载UTF-8文本文件,那就完全可以,但是当我们get其他编码文件时,我们无法获取原始文件,下载的文件有很多额外的" EFBFBD"字节。我想HadoopThriftServer上的这些Java代码可能会导致问题。

public String read(ThriftHandle tout, long offset,
                    int length) throws ThriftIOException {
   try {
     now = now();
     HadoopThriftHandler.LOG.debug("read: " + tout.id +
                                  " offset: " + offset +
                                  " length: " + length);
     FSDataInputStream in = (FSDataInputStream)lookup(tout.id);
     if (in.getPos() != offset) {
       in.seek(offset);
     }
     byte[] tmp = new byte[length];
     int numbytes = in.read(offset, tmp, 0, length);
     HadoopThriftHandler.LOG.debug("read done: " + tout.id);
     return new String(tmp, 0, numbytes, "UTF-8");
   } catch (IOException e) {
     throw new ThriftIOException(e.getMessage());
   }
 }
hdfs.py 中的Python代码是

output = open(local, 'wb')
path = Pathname();
path.pathname = hdfs;
input = self.client.open(path)

# find size of hdfs file
filesize = self.client.stat(path).length

# read 1MB bytes at a time from hdfs
offset = 0
chunksize = 1024 * 1024
while True:
   chunk = self.client.read(input, offset, chunksize)
   if not chunk: break
   output.write(chunk)
   offset += chunksize
   if (offset >= filesize): break

self.client.close(input)
output.close()

希望有人能帮助我 感谢。

0 个答案:

没有答案