Question

我正在尝试从PySpark运行一个简单的蜂巢查询，但是会引发错误。需要一些帮助。下面是代码

spark = SparkSession.builder.appName("Termination_Calls Snapshot").config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").enableHiveSupport().getOrCreate()
x_df = spark.sql("SELECT count(*) as RC from bi_schema.table_a")

这将引发如下错误

Hive Session ID = a00fe842-7099-4130-ada2-ee4ae75764be 
Traceback (mostrecent call last):   
File "<stdin>", line 1, in <module>   
File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 716, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)   
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",line 1257, in __call__   
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63,
in deco return f(*a, **kw)   
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o70.sql. : java.lang.AssertionError: assertion
failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:214)

当我在配置单元中运行相同的查询时，将得到预期的结果，如下所示。

+-------------+
|     rc      |
+-------------+
| 3037579538  |
+-------------+
1 row selected (25.469 seconds)

Answer 1

这是Spark中的错误，特定于ORC格式。

在sparkContext配置中设置以下属性可以解决问题：

spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")

如果我们仔细研究HiveMetastoreCatalog的Spark代码，那么

assert(result.output.length == relation.output.length && result.output.zip(relation.output).forall { case (a1, a2) => a1.dataType == a2.dataType }) 失败了。这意味着它正在检查列和数据类型的数量。原因之一可能是在alter table metastore之后没有更新，但这不太可能。

然后我想到了为它创建JIRA票证，但事实证明ORC格式始终存在一些问题。关于此问题，已经有两张JIRA票证：

如果我们将spark.sql.hive.convertMetastoreOrc保留为默认值true，则它将使用矢量化阅读器official doc。由于该错误，列数不匹配并且断言失败。我怀疑此属性会导致在使用向量化阅读器时添加一些虚拟列。

Answer 2

您能否尝试以下步骤一次，因为我认为我们无法使用HiveContext直接查询hive表

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
result= hive_context.table("bi_schema.table_a")

以上述方式获取表后，我们需要将该结果数据帧注册为临时表，如下所示

result.registerTempTable("table_a")

现在我们可以查询该表上的select语句，如下所示

x_df = hive_context.sql("SELECT count(*) as RC fromtable_a")

从py-spark尝试时无法从配置单元表中获取计数

2 个答案: