Apache Beam - 无法使用hadoop-file-system sdk

时间:2017-06-28 02:53:48

标签: java hadoop amazon-s3 apache-beam apache-beam-io

我正在尝试使用beam-sdks-java-io-hadoop-file-system v2.0.0和Spark作为跑步者在梁应用程序中从AWS EMR集群中读取S3。我可以在纱线日志中看到管道能够检测S3中存在的文件但是它无法读取文件。请参阅下面的日志。

17/06/27 03:29:25 INFO FileBasedSource: Filepattern s3a://xxx/test-folder/* matched 1 files with total size 3410584
17/06/27 03:29:25 INFO FileBasedSource: Matched 1 files for pattern s3a://xxx/test-folder/*
17/06/27 03:29:25 INFO FileBasedSource: Splitting filepattern s3a://xxx/test-folder/* into bundles of size 1705292 took 82 ms and produced 1 files and 1 bundles
17/06/27 03:29:25 INFO SparkContext: Starting job: foreach at BoundedDataset.java:109
  

17/06/27 03:29:33 INFO BlockManagerInfo:添加了broadcast_0_piece0 in   内存在ip-10-130-237-237.vpc.internal:40063(大小:4.6 KB,免费:   3.5 GB)       17/06/27 03:29:36 WARN TaskSetManager:阶段0.0中丢失的任务0.0(TID 0,ip-10-130-237-237.vpc.internal):java.lang.RuntimeException:   无法读取数据。         at org.apache.beam.runners.spark.io.SourceRDD $ Bounded $ ReaderToIteratorAdapter.tryProduceNext(SourceRDD.java:198)         at org.apache.beam.runners.spark.io.SourceRDD $ Bounded $ ReaderToIteratorAdapter.hasNext(SourceRDD.java:239)         在scala.collection.convert.Wrappers $ JIteratorWrapper.hasNext(Wrappers.scala:41)         在org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)         在org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284)         在org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)         在org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)         在org.apache.spark.rdd.RDD.iterator(RDD.scala:268)         在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)         在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)         在org.apache.spark.rdd.RDD.iterator(RDD.scala:270)         在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)         在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)         在org.apache.spark.rdd.RDD.iterator(RDD.scala:270)         在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)         在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)         在org.apache.spark.rdd.RDD.iterator(RDD.scala:270)         在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)         在org.apache.spark.scheduler.Task.run(Task.scala:89)         在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:227)         在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)         at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)         在java.lang.Thread.run(Thread.java:745)       引起:java.lang.UnsupportedOperationException:输入流不支持的字节缓冲区读取         在org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:146)         在org.apache.beam.sdk.io.hdfs.HadoopFileSystem $ HadoopSeekableByteChannel.read(HadoopFileSystem.java:185)         at org.apache.beam.sdk.io.CompressedSource $ CompressedReader $ CountingChannel.read(CompressedSource.java:509)         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)         at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)         在java.io.FilterInputStream.read(FilterInputStream.java:133)         在java.io.PushbackInputStream.read(PushbackInputStream.java:186)         在org.apache.beam.sdk.repackaged.com.google.common.io.ByteStreams.read(ByteStreams.java:859)         at org.apache.beam.sdk.io.CompressedSource $ CompressionMode $ 1.createDecompressingChannel(CompressedSource.java:127)         at org.apache.beam.sdk.io.CompressedSource $ DecompressAccordingToFilename.createDecompressingChannel(CompressedSource.java:258)         在org.apache.beam.sdk.io.CompressedSource $ CompressedReader.startReading(CompressedSource.java:542)         在org.apache.beam.sdk.io.FileBasedSource $ FileBasedReader.startImpl(FileBasedSource.java:471)         at org.apache.beam.sdk.io.OffsetBasedSource $ OffsetBasedReader.start(OffsetBasedSource.java:271)         在org.apache.beam.runners.spark.io.SourceRDD $ Bounded $ ReaderToIteratorAdapter.seekNext(SourceRDD.java:214)         at org.apache.beam.runners.spark.io.SourceRDD $ Bounded $ ReaderToIteratorAdapter.tryProduceNext(SourceRDD.java:188)         ......还有22个

当我使用HDFS作为输入文件系统运行相同的代码时,它完美地运行。有人可以帮我弄清楚如何从S3读取数据?输入格式是带有gzip压缩的文本文件。

代码:

HadoopFileSystemOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(HadoopFileSystemOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("ReadLines", TextIO.read().from(options.getHdfsConfiguration().get(0).get("fs.default.name")))
 .apply(ParDo.of(new PrintLineTransform()));

使用S3运行:

--hdfsConfiguration='[{"fs.default.name": "s3a://xxx/test-folder/*"}]

使用HDFS运行:

--hdfsConfiguration='[{"fs.default.name": "hdfs://xxx/test-folder/*"}]

1 个答案:

答案 0 :(得分:1)

日志说Byte-buffer read unsupported by input stream

看起来beam需要输入流中的API扩展,而实际上并未在S3客户端中实现,因此失败。虽然最近添加了JIRA HADOOP-14603,但该功能在未发布的Hadoop版本中甚至都没有。

没有人会匆忙实施该功能,因为(a)有更重要的事情要做,(b)它是一个小的性能优化(内存效率),而不是实际使用的什么重要的。您无法在本地文件系统或Azure中获取它。

修正:

  1. 通过测试实现Hadoop功能。我承诺会为你审核。
  2. 说服别人为你做这件事。对于我们专业的hadoop开发人员来说,这通常是在有人要求的时候出现的,无论是管理层还是客户。
  3. 获取Beam以处理不支持该功能的文件系统&抛出UnsupportedOperationException,回到read(byte[])。这就是他们所要做的一切。
  4. 变通方法

    • 以不同的压缩格式存储您的数据
    • 首先将其复制到HDFS。

    我建议您首先在BEAM下查找Apache JIRA上的错误/堆栈跟踪,然后在不存在的情况下提交新报告。看看他们说了什么。

    更新:密切关注BEAM-2500