Hive table is read multiple times when using spark.sql and union

时间:2019-04-23 15:10:05

标签: apache-spark pyspark

I have a single Hive table that is used in multiple subsequent spark.sql queries.

Each stage shows a HiveTableScan, that is not necessary as the table only needs to be read once.

How can I avoid this?

Here is a simplified example that replicates the problem

Create an example table:-

spark.sql("CREATE DATABASE IF NOT EXISTS default")
spark.sql("DROP TABLE IF EXISTS default.data")
spark.sql("CREATE TABLE IF NOT EXISTS default.data(value INT)")
spark.sql("INSERT OVERWRITE TABLE default.data VALUES(1)")

Run multiple queries that build on the previous dataframe:-

query1 = spark.sql("select value from default.data")
query1.createOrReplaceTempView("query1")

query2 = spark.sql("select max(value)+1 as value from query1").union(query1)
query2.createOrReplaceTempView("query2")

query3 = spark.sql("select max(value)+1 as value from query2").union(query2)
query3.createOrReplaceTempView("query3")

spark.sql("select value from query3").show()

Expected output is:-

|value|
+-----+
|    3|
|    2|
|    1|
+-----+

1 个答案:

答案 0 :(得分:0)

已编辑

您可以使用 cacheTable (字符串tableName)吗?

尝试:

query1 = spark.sql("select value from default.data")
query1.createOrReplaceTempView("query1")

spark.sqlContext().cacheTable("query1")

query2 = spark.sql("select max(value)+1 as value from query1").union(query1)
query2.createOrReplaceTempView("query2")

spark.sqlContext().cacheTable("query2")

query3 = spark.sql("select max(value)+1 as value from query2").union(query2)
query3.createOrReplaceTempView("query3")

spark.sqlContext().cacheTable("query3")

spark.sql("select value from query3").show()

使用此功能,Spark-Sql将使用内存中的列格式缓存表,以最大程度地减少内存使用量。 然后,您可以使用 uncacheTable()取消缓存表,如下所示:

spark.sqlContext().uncacheTable("query1");
spark.sqlContext().uncacheTable("query2");
spark.sqlContext().uncacheTable("query3");