我正在尝试使用beam-sdks-java-io-hadoop-file-system v2.0.0和Spark作为跑步者在梁应用程序中从AWS EMR集群中读取S3。我可以在纱线日志中看到管道能够检测S3中存在的文件但是它无法读取文件。请参阅下面的日志。
17/06/27 03:29:25 INFO FileBasedSource: Filepattern s3a://xxx/test-folder/* matched 1 files with total size 3410584
17/06/27 03:29:25 INFO FileBasedSource: Matched 1 files for pattern s3a://xxx/test-folder/*
17/06/27 03:29:25 INFO FileBasedSource: Splitting filepattern s3a://xxx/test-folder/* into bundles of size 1705292 took 82 ms and produced 1 files and 1 bundles
17/06/27 03:29:25 INFO SparkContext: Starting job: foreach at BoundedDataset.java:109
17/06/27 03:29:33 INFO BlockManagerInfo:添加了broadcast_0_piece0 in 内存在ip-10-130-237-237.vpc.internal:40063(大小:4.6 KB,免费: 3.5 GB) 17/06/27 03:29:36 WARN TaskSetManager:阶段0.0中丢失的任务0.0(TID 0,ip-10-130-237-237.vpc.internal):java.lang.RuntimeException: 无法读取数据。 at org.apache.beam.runners.spark.io.SourceRDD $ Bounded $ ReaderToIteratorAdapter.tryProduceNext(SourceRDD.java:198) at org.apache.beam.runners.spark.io.SourceRDD $ Bounded $ ReaderToIteratorAdapter.hasNext(SourceRDD.java:239) 在scala.collection.convert.Wrappers $ JIteratorWrapper.hasNext(Wrappers.scala:41) 在org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) 在org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) 在org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) 在org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:268) 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 在org.apache.spark.scheduler.Task.run(Task.scala:89) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:227) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615) 在java.lang.Thread.run(Thread.java:745) 引起:java.lang.UnsupportedOperationException:输入流不支持的字节缓冲区读取 在org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:146) 在org.apache.beam.sdk.io.hdfs.HadoopFileSystem $ HadoopSeekableByteChannel.read(HadoopFileSystem.java:185) at org.apache.beam.sdk.io.CompressedSource $ CompressedReader $ CountingChannel.read(CompressedSource.java:509) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) 在java.io.FilterInputStream.read(FilterInputStream.java:133) 在java.io.PushbackInputStream.read(PushbackInputStream.java:186) 在org.apache.beam.sdk.repackaged.com.google.common.io.ByteStreams.read(ByteStreams.java:859) at org.apache.beam.sdk.io.CompressedSource $ CompressionMode $ 1.createDecompressingChannel(CompressedSource.java:127) at org.apache.beam.sdk.io.CompressedSource $ DecompressAccordingToFilename.createDecompressingChannel(CompressedSource.java:258) 在org.apache.beam.sdk.io.CompressedSource $ CompressedReader.startReading(CompressedSource.java:542) 在org.apache.beam.sdk.io.FileBasedSource $ FileBasedReader.startImpl(FileBasedSource.java:471) at org.apache.beam.sdk.io.OffsetBasedSource $ OffsetBasedReader.start(OffsetBasedSource.java:271) 在org.apache.beam.runners.spark.io.SourceRDD $ Bounded $ ReaderToIteratorAdapter.seekNext(SourceRDD.java:214) at org.apache.beam.runners.spark.io.SourceRDD $ Bounded $ ReaderToIteratorAdapter.tryProduceNext(SourceRDD.java:188) ......还有22个
当我使用HDFS
作为输入文件系统运行相同的代码时,它完美地运行。有人可以帮我弄清楚如何从S3读取数据?输入格式是带有gzip压缩的文本文件。
代码:
HadoopFileSystemOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(HadoopFileSystemOptions.class);
Pipeline p = Pipeline.create(options);
p.apply("ReadLines", TextIO.read().from(options.getHdfsConfiguration().get(0).get("fs.default.name")))
.apply(ParDo.of(new PrintLineTransform()));
使用S3运行:
--hdfsConfiguration='[{"fs.default.name": "s3a://xxx/test-folder/*"}]
使用HDFS运行:
--hdfsConfiguration='[{"fs.default.name": "hdfs://xxx/test-folder/*"}]
答案 0 :(得分:1)
日志说Byte-buffer read unsupported by input stream
看起来beam需要输入流中的API扩展,而实际上并未在S3客户端中实现,因此失败。虽然最近添加了JIRA HADOOP-14603,但该功能在未发布的Hadoop版本中甚至都没有。
没有人会匆忙实施该功能,因为(a)有更重要的事情要做,(b)它是一个小的性能优化(内存效率),而不是实际使用的什么重要的。您无法在本地文件系统或Azure中获取它。
修正:
UnsupportedOperationException
,回到read(byte[])
。这就是他们所要做的一切。变通方法
我建议您首先在BEAM下查找Apache JIRA上的错误/堆栈跟踪,然后在不存在的情况下提交新报告。看看他们说了什么。
更新:密切关注BEAM-2500