Apache Beam中的Azure Blob支持?

时间:2016-12-29 20:26:31

标签: java apache-spark azure-storage azure-storage-blobs apache-beam

我想知道Apache Beam是否支持windows azure storage blob文件(wasb)IO。还有支持吗?

我问,因为我已经部署了一个apache beam应用程序来在Azure Spark群集上运行作业,基本上,从具有该spark集群的关联存储容器中IO isb文件是不可能的。有没有替代方案?

上下文:我试图在Azure Spark Cluster上运行WordCount example。已经设置了here所述的一些组件,相信这会对我有所帮助。下面是我的代码中我设置hadoop配置的部分:

final SparkPipelineOptions options = PipelineOptionsFactory.create().as(SparkPipelineOptions.class);

options.setAppName("WordCountExample");
options.setRunner(SparkRunner.class);
options.setSparkMaster("yarn");
JavaSparkContext context = new JavaSparkContext();
Configuration conf = context.hadoopConfiguration();
conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
conf.set("fs.azure.account.key.<storage-account>.blob.core.windows.net",
         "<key>");
options.setProvidedSparkContext(context);
Pipeline pipeline = Pipeline.create(options);

但不幸的是,我一直以下列错误结束:

java.lang.IllegalStateException: Failed to validate wasb://<storage-container>@<storage-account>.blob.core.windows.net/user/spark/kinglear.txt
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:288)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:195)
at org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
at org.apache.beam.runners.spark.SparkRunner.apply(SparkRunner.java:129)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:400)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:323)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:58)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:173)
at spark.example.WordCount.main(WordCount.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: java.io.IOException: Unable to find handler for  wasb://<storage-container>@<storage-account>.blob.core.windows.net/user/spark/kinglear.txt
at org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:187)
at org.apache.beam.sdk.io.TextIO$Read$Bound.apply(TextIO.java:283)
... 13 more

我正在考虑为Azure存储Blob实现针对Apache Beam的自定义IO,如果这是一个解决方案,我想与社区核实这是否是替代解决方案。

1 个答案:

答案 0 :(得分:2)

Apache Beam此时没有Windows Azure存储Blob(WASB)的内置连接器

Apache Beam项目正积极努力增加对HadoopFileSystem的支持。我相信WASB在hadoop-azure module中有HadoopFileSystem的连接符。这将使得WASB可以间接地与Beam一起使用 - 这可能是最简单的前进路径,它应该很快就会准备就绪。

现在,在Beam中为WASB提供原生支持会很棒。它可能会实现另一个级别的性能,并且应该相对简单地实现。据我所知,没有人积极致力于此,但这对项目来说是一个很棒的贡献! (如果您个人对贡献感兴趣,请联系!)