为什么我无法实例化'org.apache.spark.sql.hive.HiveSessionStateBuilder?

时间:2019-07-17 06:36:19

标签: apache-spark hive

我正在ssh服务器上工作,我通过以下命令在其中加载spark:

module load spark/2.3.0

我想创建一个Hive表以将DataFrame分区保存到该表中。

我的代码mycode.py如下:

if __name__ == "__main__":
    warehouse_location = abspath('spark-warehouse')
    conf = (SparkConf()
    .setMaster("local[*]")
    .setAppName(appName)
    .set("spark.default.parallelism", 128)
    .set("spark.sql.shuffle.partitions", 128)
    )

    spark = SparkSession.builder.config(conf=conf).config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
    sc = spark.sparkContext
    sqlContext = SQLContext(sparkContext = sc)
    sc.stop()

此代码生成以下异常:

    py4j.protocol.Py4JJavaError: An error occurred while calling o41.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
        at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1064)
        at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:141)
        at org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:140)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:140)
        at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:137)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.HiveSessionStateBuilder
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:235)
        at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1059)
        ... 16 more

如何解决此问题?请问我的错误在哪里?请注意,我使用spark-submit mycode.py运行以上代码。我不知道是否需要向此通用字段添加任何参数

1 个答案:

答案 0 :(得分:0)

就我而言,这是因为Spark缺少Hive依赖项

我所做的是将Jars添加到Pyspark依赖项

submit_args = '--packages org.apache.spark:spark-hive_2.11:2.4.6 pyspark-shell'
if 'PYSPARK_SUBMIT_ARGS' not in os.environ:
    os.environ['PYSPARK_SUBMIT_ARGS'] = submit_args
else:
    os.environ['PYSPARK_SUBMIT_ARGS'] += submit_args