pyspark ImageSchema.toNDArray引发AttributeError:'NoneType'对象没有属性'_jvm'

时间:2018-06-22 06:41:03

标签: pyspark apache-spark-mllib

Spark 2.3的新pyspark.ml.image功能遇到问题。

在“本地计算”中使用ImageSchema.toNDArray()时,可以。但是在rdd.map()中使用它会引发错误,

  

AttributeError:“ NoneType”对象没有属性“ _jvm”。

您可以在pyspark中尝试以下代码,并在文件夹“ jpg”中准备图片。例如,我将this single picture放入其中。

在“本地计算”中可以:

>>> from pyspark.ml.image import ImageSchema
>>> df = ImageSchema.readImages("jpg")
>>> row = df.collect()[0]               # collect() to a "local" list and take the first
>>> ImageSchema.toNDArray(row.image)    # so this toNDArray() is a "local computation"
array([[[228, 141,  97],
        [229, 142,  98],
        [229, 142,  98],
        ...,
        [239, 157, 110],
        [239, 157, 110],
        [239, 157, 109]],
        ...    
        ...
       [[ 66,  38,  21],
        [ 66,  38,  21],
        [ 66,  38,  21],
        ...,
        [ 91,  55,  37],
        [ 94,  57,  37],
        [ 94,  57,  37]]], dtype=uint8)

但是如果我将其放在rdd.map()中,它将引发

  

AttributeError:'NoneType'对象没有属性'_jvm'

>>> from pyspark.ml.image import ImageSchema
>>> df = ImageSchema.readImages("jpg")
>>> df.rdd.map(lambda row: ImageSchema.toNDArray(row.image)).take(1)

...
...

  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib/pyspark.zip/pyspark/ml/image.py", line 123, in toNDArray
    if any(not hasattr(image, f) for f in self.imageFields):
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib/pyspark.zip/pyspark/ml/image.py", line 90, in imageFields
    if self._imageFields is None:
        ctx = SparkContext._active_spark_context
        self._imageFields = list(ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageFields())
AttributeError: 'NoneType' object has no attribute '_jvm'

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        ...
        ...

这种情况已经过测试并且可以重现

Spark 2.3.0 provided by Cloudera parcel
Spark 2.3.0 on Hortonworks
Spark 2.3.0 on Windows with WinUtils
Spark 2.3.1 on Windows with WinUtils

怎么了?

我该如何解决?

1 个答案:

答案 0 :(得分:0)

我认为这是pyspark.ml.image的错误,因为如果像下面这样修改.../lib/spark2/python/pyspark/ml/image.py中的所有行

来自

ctx = SparkContext._active_spark_context

进入

ctx = SparkContext.getOrCreate()

然后一切正常。

但是,我不是pyspark的专家。我认为在选择答案之前,最好让它进行讨论。

P.S。我并不是说应该以这种方式纠正它。我只是认为这可能是一个错误。