如何从hdfs读取二进制文件?

时间:2017-05-26 00:28:53

标签: hadoop apache-spark hdfs

我现在正在为使用spark编写shapefile的解析器。我使用NewAPIHadoopFile首先从原始.shp文件中逐个提取二进制记录。问题是,当程序从本地磁盘获取文件时,它可以工作。但是当从hdfs读取文件时,我从DataInputStream获得的字节流不再与原始文件集成。例外情况如下。

java.lang.NegativeArraySizeException
    at ShapeFileParse.ShapeParseUtil.parseRecordPrimitiveContent(ShapeParseUtil.java:53)
    at spatial.ShapeFileReader.nextKeyValue(ShapeFileReader.java:54)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:182)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
17/05/25 17:19:15 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NegativeArraySizeException
    at ShapeFileParse.ShapeParseUtil.parseRecordPrimitiveContent(ShapeParseUtil.java:53)
    at spatial.ShapeFileReader.nextKeyValue(ShapeFileReader.java:54)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:182)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1702)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:86)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

RecordReader中的我的代码如下:

public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        ShapeParseUtil.initializeGeometryFactory();
        FileSplit fileSplit = (FileSplit)split;
        long start = fileSplit.getStart();
        long end = start + fileSplit.getLength();
        int len = (int)fileSplit.getLength();
        Path filePath = fileSplit.getPath();
        FileSystem fileSys = filePath.getFileSystem(context.getConfiguration());
        FSDataInputStream inputStreamFS = fileSys.open(filePath);
        //byte[] wholeStream = new byte[len];
        inputStream = new DataInputStream(inputStreamFS);
        //IOUtils.readFully(inputStream, wholeStream, 0, len);
        //inputStream = new DataInputStream(new ByteArrayInputStream(wholeStream));
        ShapeParseUtil.parseShapeFileHead(inputStream);
    }

以下是我设置FileInputFormat

的方法
public class ShapeInputFormat extends FileInputFormat<ShapeKey, BytesWritable> {
    public RecordReader<ShapeKey, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        return new ShapeFileReader(split, context);
    }

    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }
}

当我使用IOUtils.readFully进行测试时,我得到的整个字节数组都很好。但是在使用DataInputStream时,流程出错了。所以我相信hdfs上的文件仍然是集成的。由于我的文件只有560kb,现在hdfs上只有一个文件。我认为它不能分配给多个块。

我是新手,所以也许这是一个天真的问题,非常感谢谁可以教我这个。

1 个答案:

答案 0 :(得分:0)

在读取每个记录中的字节数组时,我通过将read()更改为readFully()来自行解决。我仍然有点困惑,因为我得到的available()完全等于split的大小。但是如果我调用inputStream.read(),它就无法读取我指定的长度。