我有一个简单的火花代码,
test("SparkTest 0462") {
val spark = SparkSession.builder().master("local").appName("SparkTest0460").enableHiveSupport().getOrCreate()
import spark.implicits._
val data1 = Seq((1, 2), (1, 7), (3, 6), (5, 4), (1, 10), (6, 7), (2, 5))
spark.sql("create table if not exists SparkTest0462_1(a int, b int) stored as textfile")
spark.sql("create table if not exists SparkTest0462_2(a int, b int) stored as textfile")
spark.createDataset(data1).toDF("a", "b").createOrReplaceTempView("x")
val df = spark.sql(
"""
from (select a, b from x)
insert overwrite table SparkTest0462_1 select a, b
insert overwrite table SparkTest0462_2 select a, b
""".stripMargin(' '))
df.explain(true)
df.count()
}
以上代码的物理计划是:
== Physical Plan ==
UnionExec
:- Execute InsertIntoHiveTable InsertIntoHiveTable `default`.`sparktest0462_1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, true, false, [a, b]
: +- LocalTableScanExec [a#7, b#8]
+- Execute InsertIntoHiveTable InsertIntoHiveTable `default`.`sparktest0462_2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, true, false, [a, b]
+- LocalTableScanExec [a#7, b#8]
对于每个insert overwrite
,spark sql都会运行一次查询(由from子句定义),然后执行插入操作?
如果from子句中的查询非常复杂(例如,完全连接),并且Spark SQL对每个插入都运行该查询,则将非常耗时 从Hive开始,hive看起来将运行一次查询,然后将结果插入两个表中,这将节省大量时间?
不确定在此from .. insert overwrite ... insert overwrite...
条款中我是否正确理解了Spark和Hive的两种行为