Question

无法通过Hive访问通过Spark（pyspark）创建的Hive表。

df.write.format("orc").mode("overwrite").saveAsTable("db.table")

从Hive访问时出错：

错误：java.io.IOException：java.lang.IllegalArgumentException：bucketId超出范围：-1（状态=，代码= 0）

在Hive中成功创建表，并能够在Spark中读取该表。可以访问表元数据（在Hive中）和表中数据文件（在hdfs中）。

Hive表的TBLPROPERTIES是：

  'bucketing_version'='2',                         
  'spark.sql.create.version'='2.3.1.3.0.0.0-1634', 
  'spark.sql.sources.provider'='orc',              
  'spark.sql.sources.schema.numParts'='1',

我还尝试了使用其他解决方法创建表，但是在创建表时出错：

df.write.mode("overwrite").saveAsTable("db.table")

OR

df.createOrReplaceTempView("dfTable")
spark.sql("CREATE TABLE db.table AS SELECT * FROM dfTable")

错误：

AnalysisException：u'org.apache.hadoop.hive.ql.metadata.HiveException：MetaException（message：Table default.src由于严格原因未能通过严格的托管表检查：表被标记为托管表，但未标记为托管表事务性）。'

堆栈版本详细信息：

Spark2.3

Hive3.1

Hortonworks数据平台HDP3.0

Answer 1

从HDP 3.0开始，Apache Hive和Apache Spark的目录是分开的，它们使用自己的目录。也就是说，它们是互斥的-Apache Hive目录只能由Apache Hive或该库访问，而Apache Spark目录只能由Apache Spark中的现有API访问。换句话说，某些功能（例如ACID表或带有Apache Hive表的Apache Ranger）只能通过Apache Spark中的此库使用。不能在Apache Spark API本身中直接访问Hive中的那些表。

下面的文章解释了这些步骤：

Integrating Apache Hive with Apache Spark - Hive Warehouse Connector

Answer 2

设置以下属性后，我遇到了同样的问题，它工作正常。

set hive.mapred.mode=nonstrict;
set hive.optimize.ppd=true;
set hive.optimize.index.filter=true;
set hive.tez.bucket.pruning=true;
set hive.explain.user=false; 
set hive.fetch.task.conversion=none;
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

通过Spark加载的表格无法在Hive中访问

2 个答案: