PySpark无法将任何数据文件理解为数据帧

时间:2017-06-12 15:51:22

标签: python apache-spark pyspark apache-spark-sql parquet

过去几天我发现了一个奇怪的错误,无法解决这个问题。

  1. 我正在使用pyspark并尝试将csv加载到DF [下面的代码],它会出现同样的错误:

     py4j.protocol.Py4JJavaError: An error occurred while calling o10.textFile.
    : java.lang.reflect.InaccessibleObjectException: Unable to make field transient java.lang.Object[] 

  2. 我实际上想最初将csv转换为镶木地板,甚至这会导致同样的错误。 (它成功转换为镶木地板,但是当我尝试打印此表中某些列的架构或计数时,它会失败并出现与上面相同的错误)

  3. csv到Df的代码:

        from pyspark import SparkContext
        from pyspark.sql import SQLContext
        from pyspark.sql.types import *
        import os
        os.environ['SPARK_HOME']="/opt/apps/spark-2.0.1-bin-hadoop2.7/"
    
    
        from pyspark import SQLContext
        sc = SparkContext(master='local')
        sqlContext = SQLContext(sc)
    
    
        Employee_rdd = sc.textFile("abc.csv").map(lambda line: line.split(","))
        Employee_df = Employee_rdd.toDF()
        Employee_df.show()
    

    错误堆栈跟踪:

    
        File "/home/v/scripts/g_s_pipe/a.py", line 14, in 
            Employee_rdd = sc.textFile("abc.csv").map(lambda line: line.split(","))
          File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/pyspark/context.py", line 476, in textFile
            return RDD(self._jsc.textFile(name, minPartitions), self,
          File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
          File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
            return f(*a, **kw)
          File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
        py4j.protocol.Py4JJavaError: An error occurred while calling o10.textFile.
        : java.lang.reflect.InaccessibleObjectException: Unable to make field transient java.lang.Object[] java.util.ArrayList.elementData accessible: module java.base does not "opens java.util" to unnamed module @51b63e70
            at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:335)
            at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:278)
            at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:175)
            at java.base/java.lang.reflect.Field.setAccessible(Field.java:169)
    
    

    csv to parquet的代码:

        import os
        os.environ['SPARK_HOME']="/opt/apps/spark-2.0.1-bin-hadoop2.7/"
    
        from pyspark import SparkContext
        from pyspark.sql import functions as F
        from pyspark.sql import *
        from pyspark.sql.types import *
        sc = SparkContext(master='local')
    
        sqlContext = SQLContext(sc)
    
    
        df= sqlContext.read.parquet("tract_alpha.parquet")
        print (df.count())
    

    它也会出现同样的错误:

    
        17/06/12 21:17:21 WARN BlockManager: Putting block broadcast_1 failed due to an exception
        17/06/12 21:17:21 WARN BlockManager: Block broadcast_1 could not be removed as it was not found on disk or in memory
        Traceback (most recent call last):
          File "/home/vna/scripts/global_score_pipeline/test_code_here.py", line 62, in 
            print (df.count())
          File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 299, in count
            return int(self._jdf.count())
          File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
          File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
            return f(*a, **kw)
          File "/opt/apps/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
        py4j.protocol.Py4JJavaError: An error occurred while calling o25.count.
        : java.lang.reflect.InaccessibleObjectException: Unable to make field transient java.lang.Object[] java.util.ArrayList.elementData accessible: module java.base does not "opens java.util" to unnamed module @5e37932e
            at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:335)
            at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:278)
            at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:175)
            at java.base/java.lang.reflect.Field.setAccessible(Field.java:169)
            at org.apache.spark.util.SizeEstimator$$anonfun$getClassInfo$3.apply(SizeEstimator.scala:336)
            at org.apache.spark.util.SizeEstimator$$anonfun$getClassInfo$3.apply(SizeEstimator.scala:330)
            at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
            at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
            at org.apache.spark.util.SizeEstimator$.getClassInfo(SizeEstimator.scala:330)
    
    

    这是什么InaccessibleObject异常,我在Google上找不到这方面的帮助,以及如何解决这个问题?

0 个答案:

没有答案