我在CDH 5.13上的配置单元中运行着大约20-30个查询。 为了评估在同一个查询上运行spark时的性能,我编写了一个代码,该文件从文件中读取查询并使用HiveContext在spark上执行它们。
我运行查询的Python代码类似
1
所有查询均采用以下格式。
#step-1: readQueryFromFile is custom function that reads file uploaded via spark-submit and accessible to this driver code ad finalQuery is printed
finalQuery = readQueryFromFile(sqlFile,dict)
print finalQuery
# step-2: prepare context
sparkCtx = SparkContext.getOrCreate(SparkConf())
hc = HiveContext(sparkCtx)
# step-3: run sql
rdd = hc.sql(finalQuery)
# step-4: called explicit action
rdd.show()
鉴于这种情况,这是我的问题:
Q-1:由于这些查询不会在驱动程序的数据帧中返回数据,而是创建临时表,因此我假设所有操作都在执行程序节点上执行,并且不必担心驱动程序/主节点的内存。我在这个假设中正确吗?
问题2:我真的需要在步骤4调用操作来调用查询吗?
Q-3:以下查询使用左联接创建表导致内存不足错误,并且我无法再增加内存。如果内存中无法容纳某些东西,那么应该将spark写入硬盘,对吗?如果是,是否有任何提示可以帮助确保操作成功?
注意:此查询在配置单元上运行良好。
create table [tablename] stored as parquet as select * from stg_[tablename]
这是我的火花提交
create table test_final stored as parquet as
select * from stg_test1 w
left join (
select * from test1
where strt_dt <= '${hiveconf:mon_lst_d}' or strt_dt is null
) l
on w.k1 = l.k1 and w.k2 = l.k2 and w.k3 = l.k3
left join (
select * from test2
where strt_dt <= '${hiveconf:mon_lst_d}' or strt_dt is null
) e
on w.k1 = e.k1 and w.k2 = e.k2 and l.k4 = e.k4
left join (
select * from test3
where strt_dt <= '${hiveconf:mon_lst_d}' or strt_dt is null
) c
on w.k1 = c.k1 and w.k2 = c.k2 and w.k5 = c.k5
left join (
select * from test4
where strt_dt <= '${hiveconf:mon_lst_d}' or strt_dt is null
) j
on w.k1 = j.k1 and w.k2 = j.k2 and w.k6 = j.k6
left join (
select * from test5
where strt_dt <= '${hiveconf:mon_lst_d}' or strt_dt is null
) o
on w.k1 = o.k1 and w.k2 = o.k2 and w.k7 = o.k7
left join (
select * from test6
where strt_dt <= '${hiveconf:mon_lst_d}' or strt_dt is null
) s
on w.k1 = s.k1 and w.k2= s.k2 and w.k8 = s.8 and w.k9 = s.k9 and w.k10 = s.k10