dataframe,spark 2.3.1上的Describe()函数抛出Py4JJavaError

时间:2018-07-30 20:26:19

标签: python-3.x apache-spark pyspark jupyter-notebook

我在ubuntu上使用Spark 2.3.1和python 3.6.5。运行dataframe.Describe()函数时,我在Jupyter Notebook上遇到以下错误。

    ---------------------------------------------------------------------------
    Py4JJavaError                             Traceback (most recent call last)
    <ipython-input-19-ea8415b8a3ee> in <module>()
    ----> 1 df.describe()

    ~/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py in describe(self, *cols)
       1052         if len(cols) == 1 and isinstance(cols[0], list):
       1053             cols = cols[0]
    -> 1054         jdf = self._jdf.describe(self._jseq(cols))
       1055         return DataFrame(jdf, self.sql_ctx)
       1056 

    ~/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
       1255         answer = self.gateway_client.send_command(command)
       1256         return_value = get_return_value(
    -> 1257             answer, self.gateway_client, self.target_id, self.name)
       1258 
       1259         for temp_arg in temp_args:

    ~/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
         61     def deco(*a, **kw):
         62         try:
    ---> 63             return f(*a, **kw)
         64         except py4j.protocol.Py4JJavaError as e:
         65             s = e.java_exception.toString()

    ~/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
        326                 raise Py4JJavaError(
        327                     "An error occurred while calling {0}{1}{2}.\n".
    --> 328                     format(target_id, ".", name), value)
        329             else:
        330                 raise Py4JError(

    Py4JJavaError: An error occurred while calling o132.describe.
    : java.lang.IllegalArgumentException
        at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
        at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
        at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
        at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
        at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
        at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
        at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
        at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
        at org.apache.spark.sql.execution.stat.StatFunctions$.aggResult$lzycompute$1(StatFunctions.scala:273)
        at org.apache.spark.sql.execution.stat.StatFunctions$.org$apache$spark$sql$execution$stat$StatFunctions$$aggResult$1(StatFunctions.scala:273)
        at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$summary$2.apply$mcVI$sp(StatFunctions.scala:286)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
        at org.apache.spark.sql.execution.stat.StatFunctions$.summary(StatFunctions.scala:285)
        at org.apache.spark.sql.Dataset.summary(Dataset.scala:2473)
        at org.apache.spark.sql.Dataset.describe(Dataset.scala:2412)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:564)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:844)

这是我正在使用的测试代码:

    import findspark
    findspark.init('/home/pathirippilly/spark-2.3.1-bin-hadoop2.7')
    from pyspark.sql import SparkSession
    from pyspark.sql.types import  StringType,StructType,StructField,IntegerType
    spark=SparkSession.builder.appName('Basics').getOrCreate()
    df=spark.read.json('people.json')
    df.describe() #not working
    df.describe().show #not working

我已经安装了以下版本的java,scala,python和spark。

    pathirippilly@sparkBox:/usr/lib/jvm$ java -version

openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)

pathirippilly@sparkBox:/usr/lib/jvm$ bashscala -version

Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

python : 3.6.5

Spark version is spark-2.3.1-bin-hadoop2.7

我的环境变量设置如下。我已经将所有这些变量保存在/ etc / environment中,并通过/etc/bash.bashbrc

进行调用
JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
PYSPARK_DRIVER_OPTION="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"
PYSPARK_PYTHON=python3
SPARK_HOME='/home/pathirippilly/spark-2.3.1-bin-hadoop2.7/'
PATH=$SPARK_HOME:$PATH
PYTHONPATH=$SPARK_HOME/python/

我还没有配置spark_env.sh。是否需要配置spark_env.sh?

是因为存在可比性问题吗?还是我在这里做错了什么?

如果有人可以将我引导到正确的方向,那将非常有帮助。

注意:df.show()可以完美地工作。

3 个答案:

答案 0 :(得分:3)

此问题已为我解决。我从一开始就重新配置了整个设置。我准备了如下的/ etc / environment文件

    export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/$
    export SPARK_HOME='/home/pathirippilly/spark-2.3.1-bin-hadoop2.7'
    export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
    export PYSPARK_DRIVER_PYTHON='jupyter'
    export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
    export PYSPARK_PYTHONPATH=python3
    export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
    export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

我已经在/etc/bash.bashrc

中添加了以下内容
    source /etc/environment

注意: *

  1. 我的pyspark在我的PYTHONPATH中可用,所以从每次 在我的终端中打开一个会话/etc/bash.bashrc将做源 / etc / environment将会导出所有环境 变量

  2. 我使用java-1.8.0-openjdk-amd64代替了Java 10或11。但是 我认为10或11也可以按照pyspark 2.3.1版本工作 文档。不确定。

  3. 我仅使用了scala 2.11.12。

  4. 我的py4j模块也可以在我的PTHONPATH中使用。

我不确定以前在哪里弄乱。但是现在通过上面的设置我的pyspark 2.3.1 在Java1.8,Scala 2.11.12,Python 3.6.5(并且没有findspark模块)下运行正常

答案 1 :(得分:1)

OP,我的设置与您的设置完全相同,实际上我们在Udemy 中遵循相同的Spark课程(设置他们对字母所说的所有内容),并且遇到了相同的错误在同一个地方。我对其进行更改的唯一一件事是Java版本。完成课程后,
$ sudo apt-get install default-jre
已安装8,但现在已安装11。然后,我卸载了该Java并运行了
$ sudo apt-get install openjdk-8-jre
然后将JAVA_HOME路径更改为指向它,现在它可以工作了。

答案 2 :(得分:0)

在Udemy中进行相同的Spark课程时,我遇到了相同的错误。下面是解决该问题的步骤。

删除openjdk版本11:

1)sudo apt-get autoremove default-jdk openjdk-11-jdk 它将要求确认,请提供相同的信息。 2)sudo apt-get删除default-jre。

安装jdk 8并对其进行配置 3)sudo apt-get install openjdk-8-jre 将JAVA_HOME指向此新安装的jdk 4)导出JAVA_HOME =“ / usr / lib / jvm / java-1.8.0-openjdk-amd64”

按照上述方法,df.describe()中的错误已解决。