我正在ssh服务器上工作,我通过以下命令在其中加载spark:
module load spark/2.3.0
我想创建一个Hive表以将DataFrame分区保存到该表中。
我的代码mycode.py
如下:
if __name__ == "__main__":
warehouse_location = abspath('spark-warehouse')
conf = (SparkConf()
.setMaster("local[*]")
.setAppName(appName)
.set("spark.default.parallelism", 128)
.set("spark.sql.shuffle.partitions", 128)
)
spark = SparkSession.builder.config(conf=conf).config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sparkContext = sc)
sc.stop()
此代码生成以下异常:
py4j.protocol.Py4JJavaError: An error occurred while calling o41.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1064)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:141)
at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:140)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:140)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:137)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.HiveSessionStateBuilder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:235)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1059)
... 16 more
如何解决此问题?请问我的错误在哪里?请注意,我使用spark-submit mycode.py
运行以上代码。我不知道是否需要向此通用字段添加任何参数
答案 0 :(得分:0)
就我而言,这是因为Spark缺少Hive依赖项
我所做的是将Jars添加到Pyspark依赖项
submit_args = '--packages org.apache.spark:spark-hive_2.11:2.4.6 pyspark-shell'
if 'PYSPARK_SUBMIT_ARGS' not in os.environ:
os.environ['PYSPARK_SUBMIT_ARGS'] = submit_args
else:
os.environ['PYSPARK_SUBMIT_ARGS'] += submit_args