Question

I want to load images from HDFS into spark's RDD.
And then process those images with Spark.

I tested:

JavaPairRDD<String, String> pairRdd = jsc.wholeTextFiles("hdfs://cluster-1-m/user/username/images/");

to load images from HDFS to Spark's RDD.

Then when I call the imread methods to read the images :

Mat image = imread(value._1()); // value is the tuple2<String, String> comming from pairRdd

I find that the image is null!

I am using:

Answer 1

图像为空，因为value._1()是HDFS路径，而不是JavaCV期望的本地文件。不是＆＃34;整个文件＆＃34;在HDFS意义上，因为它们被破坏并分布成块。

您需要先从HDFS下载文件，然后才能使用JavaCV在本地处理它。

使用本机Hadoop API代替Spark，您可以这样做。

或者您可以尝试将value._2()的内容流式传输到File个对象。（实际上，您可能希望使用binaryFiles(path)方法来代替任何非文本的内容，例如图像。

换句话说，你并没有使用Spark＆＃34;这里除了扫描HDFS目录。替代解决方案是通过RDD map()，通过JAR文件将JavaCV打包到Spark代码中，然后您需要将图像下载到Spark执行器，如上所述。