我在ubuntu上使用Spark 2.3.1和python 3.6.5。运行dataframe.Describe()函数时,我在Jupyter Notebook上遇到以下错误。
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-19-ea8415b8a3ee> in <module>()
----> 1 df.describe()
~/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/dataframe.py in describe(self, *cols)
1052 if len(cols) == 1 and isinstance(cols[0], list):
1053 cols = cols[0]
-> 1054 jdf = self._jdf.describe(self._jseq(cols))
1055 return DataFrame(jdf, self.sql_ctx)
1056
~/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o132.describe.
: java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2073)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.sql.execution.stat.StatFunctions$.aggResult$lzycompute$1(StatFunctions.scala:273)
at org.apache.spark.sql.execution.stat.StatFunctions$.org$apache$spark$sql$execution$stat$StatFunctions$$aggResult$1(StatFunctions.scala:273)
at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$summary$2.apply$mcVI$sp(StatFunctions.scala:286)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.sql.execution.stat.StatFunctions$.summary(StatFunctions.scala:285)
at org.apache.spark.sql.Dataset.summary(Dataset.scala:2473)
at org.apache.spark.sql.Dataset.describe(Dataset.scala:2412)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:844)
这是我正在使用的测试代码:
import findspark
findspark.init('/home/pathirippilly/spark-2.3.1-bin-hadoop2.7')
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,StructType,StructField,IntegerType
spark=SparkSession.builder.appName('Basics').getOrCreate()
df=spark.read.json('people.json')
df.describe() #not working
df.describe().show #not working
我已经安装了以下版本的java,scala,python和spark。
pathirippilly@sparkBox:/usr/lib/jvm$ java -version
openjdk version "10.0.1" 2018-04-17
OpenJDK Runtime Environment (build 10.0.1+10-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)
pathirippilly@sparkBox:/usr/lib/jvm$ bashscala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
python : 3.6.5
Spark version is spark-2.3.1-bin-hadoop2.7
我的环境变量设置如下。我已经将所有这些变量保存在/ etc / environment中,并通过/etc/bash.bashbrc
进行调用JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"
PYSPARK_DRIVER_OPTION="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"
PYSPARK_PYTHON=python3
SPARK_HOME='/home/pathirippilly/spark-2.3.1-bin-hadoop2.7/'
PATH=$SPARK_HOME:$PATH
PYTHONPATH=$SPARK_HOME/python/
我还没有配置spark_env.sh。是否需要配置spark_env.sh?
是因为存在可比性问题吗?还是我在这里做错了什么?
如果有人可以将我引导到正确的方向,那将非常有帮助。
注意:df.show()可以完美地工作。
答案 0 :(得分:3)
此问题已为我解决。我从一开始就重新配置了整个设置。我准备了如下的/ etc / environment文件
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/$
export SPARK_HOME='/home/pathirippilly/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHONPATH=python3
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-amd64"
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
我已经在/etc/bash.bashrc
中添加了以下内容 source /etc/environment
注意: *
我的pyspark在我的PYTHONPATH中可用,所以从每次 在我的终端中打开一个会话/etc/bash.bashrc将做源 / etc / environment将会导出所有环境 变量
我使用java-1.8.0-openjdk-amd64代替了Java 10或11。但是 我认为10或11也可以按照pyspark 2.3.1版本工作 文档。不确定。
我仅使用了scala 2.11.12。
我的py4j模块也可以在我的PTHONPATH中使用。
我不确定以前在哪里弄乱。但是现在通过上面的设置我的pyspark 2.3.1 在Java1.8,Scala 2.11.12,Python 3.6.5(并且没有findspark模块)下运行正常
答案 1 :(得分:1)
OP,我的设置与您的设置完全相同,实际上我们在Udemy 中遵循相同的Spark课程(设置他们对字母所说的所有内容),并且遇到了相同的错误在同一个地方。我对其进行更改的唯一一件事是Java版本。完成课程后,
$ sudo apt-get install default-jre
已安装8,但现在已安装11。然后,我卸载了该Java并运行了
$ sudo apt-get install openjdk-8-jre
然后将JAVA_HOME路径更改为指向它,现在它可以工作了。
答案 2 :(得分:0)
在Udemy中进行相同的Spark课程时,我遇到了相同的错误。下面是解决该问题的步骤。
删除openjdk版本11:
1)sudo apt-get autoremove default-jdk openjdk-11-jdk 它将要求确认,请提供相同的信息。 2)sudo apt-get删除default-jre。
安装jdk 8并对其进行配置 3)sudo apt-get install openjdk-8-jre 将JAVA_HOME指向此新安装的jdk 4)导出JAVA_HOME =“ / usr / lib / jvm / java-1.8.0-openjdk-amd64”
按照上述方法,df.describe()中的错误已解决。