带有Hive Metastore 3.1.0的Apache Spark 2.3.1

时间:2018-10-26 14:19:29

标签: apache-spark hive apache-spark-sql hive-metastore hdp

我们已将HDP群集升级到3.1.1.3.0.1.0-187,并发现:

  1. 蜂巢具有新的元存储位置
  2. Spark无法看到Hive数据库

实际上我们看到:

org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database ... not found

您能帮助我了解发生了什么以及如何解决吗?

更新

配置:

  

(spark.sql.warehouse.dir,/ warehouse / tablespace / external / hive /)   (spark.admin.acls,)   (spark.yarn.dist.files,file:///opt/folder/config.yml,file:///opt/jdk1.8.0_172/jre/lib/security/cacerts)   (spark.history.kerberos.keytab,/ etc / security / keytabs / spark.service.keytab)   (spark.io.compression.lz4.blockSize,128kb)   (spark.executor.extraJavaOptions,-Djavax.net.ssl.trustStore = cacerts)   (spark.history.fs.logDirectory,hdfs:/// spark2-history /)   (spark.io.encryption.keygen.algorithm,HmacSHA1)   (spark.sql.autoBroadcastJoinThreshold,26214400)   (spark.eventLog.enabled,true)(spark.shuffle.service.enabled,true)   (spark.driver.extraLibraryPath,/ usr / hdp / current / hadoop-client / lib / native:/ usr / hdp / current / hadoop-client / lib / native / Linux-amd64-64)   (spark.ssl.keyStore,/ etc / security / serverKeys / server-keystore.jks)   (spark.yarn.queue,默认)   (spark.jars,文件:/opt/folder/component-assembly-0.1.0-SNAPSHOT.jar)   (spark.ssl.enabled,true)(spark.sql.orc.filterPushdown,true)   (spark.shuffle.unsafe.file.output.buffer,5m)   (spark.yarn.historyServer.address,master2.env.project:18481)   (spark.ssl.trustStore,/ etc / security / clientKeys / all.jks)   (spark.app.name,com.company.env.component.MyClass)   (spark.sql.hive.metastore.jars,/ usr / hdp / current / spark2-client / standalone-metastore / *)   (spark.io.encryption.keySizeBits,128)(spark.driver.memory,2g)   (spark.executor.instances,10)   (spark.history.kerberos.principal,spark / edge.env.project @ ENV.PROJECT)   (spark.unsafe.sorter.spill.reader.buffer.size,1m)   (spark.ssl.keyPassword,*********(已编辑))   (spark.ssl.keyStorePassword,*********(已编辑))   (spark.history.fs.cleaner.enabled,true)   (spark.shuffle.io.serverThreads,128)   (spark.sql.hive.convertMetastoreOrc,true)   (spark.submit.deployMode,client)(spark.sql.orc.char.enabled,true)   (spark.master,yarn)(spark.authenticate.enableSaslEncryption,true)   (spark.history.fs.cleaner.interval,7d)(spark.authenticate,true)   (spark.history.fs.cleaner.maxAge,90d)   (spark.history.ui.acls.enable,true)(spark.acls.enable,true)   (spark.history.provider,org.apache.spark.deploy.history.FsHistoryProvider)   (spark.executor.extraLibraryPath,/ usr / hdp / current / hadoop-client / lib / native:/ usr / hdp / current / hadoop-client / lib / native / Linux-amd64-64)   (spark.executor.memory,2g)(spark.io.encryption.enabled,true)   (spark.shuffle.file.buffer,1m)   (spark.eventLog.dir,hdfs:/// spark2-history /)(spark.ssl.protocol,TLS)   (spark.dynamicAllocation.enabled,true)(spark.executor.cores,3)   (spark.history.ui.port,18081)   (spark.sql.statistics.fallBackToHdfs,true)   (spark.repl.local.jars,file:///opt/folder/postgresql-42.2.2.jar,file:///opt/folder/ojdbc6.jar)   (spark.ssl.trustStorePassword,*********(已编辑))   (spark.history.ui.admin.acls,)(spark.history.kerberos.enabled,true)   (spark.shuffle.io.backLog,8192)(spark.sql.orc.impl,本机)   (spark.ssl.enabledAlgorithms,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA)   (spark.sql.orc.enabled,true)   (spark.yarn.dist.jars,file:///opt/folder/postgresql-42.2.2.jar,file:///opt/folder/ojdbc6.jar)   (spark.sql.hive.metastore.version,3.0)

从hive-site.xml中获取:

<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/warehouse/tablespace/managed/hive</value>
</property>

代码如下:

val spark = SparkSession
  .builder()
  .appName(getClass.getSimpleName)
  .enableHiveSupport()
  .getOrCreate()
...
dataFrame.write
  .format("orc")
  .options(Map("spark.sql.hive.convertMetastoreOrc" -> true.toString))
  .mode(SaveMode.Append)
  .saveAsTable("name")

提交火花:

    --master yarn \
    --deploy-mode client \
    --driver-memory 2g \
    --driver-cores 4 \
    --executor-memory 2g \
    --num-executors 10 \
    --executor-cores 3 \
    --conf "spark.dynamicAllocation.enabled=true" \
    --conf "spark.shuffle.service.enabled=true" \
    --conf "spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=cacerts" \
    --conf "spark.sql.warehouse.dir=/warehouse/tablespace/external/hive/" \
    --jars postgresql-42.2.2.jar,ojdbc6.jar \
    --files config.yml,/opt/jdk1.8.0_172/jre/lib/security/cacerts \
    --verbose \
    component-assembly-0.1.0-SNAPSHOT.jar \

2 个答案:

答案 0 :(得分:5)

看起来这是未实现的Spark feature。但是,我发现从3.0开始使用Spark和Hive的唯一方法是使用Horton中的HiveWarehouseConnector。文档here。霍顿社区here的好指南。 在Spark开发人员准备自己的解决方案之前,我没有回答这个问题。

答案 1 :(得分:0)

尽管免责声明,但我对此有一些回想的技巧,它绕过了护林员的权限(如果引起管理员的愤怒,请不要怪我)。

与火花壳一起使用

export HIVE_CONF_DIR=/usr/hdp/current/hive-client/conf
spark-shell --conf "spark.driver.extraClassPath=/usr/hdp/current/hive-client/conf"

与sparklyR一起使用

Sys.setenv(HIVE_CONF_DIR="/usr/hdp/current/hive-client/conf")
conf = spark_config()
conf$'sparklyr.shell.driver-class-path' = '/usr/hdp/current/hive-client/conf'

它也应该适用于thriftserver,但我尚未测试。