我不理解我不知道的一件事,如果你可以提供帮助,但我会尝试。要对spark执行tpch查询,我们如何做到这一点?我正在观看github和查询号码21上的查询,如下文所示,分为三部分。
我不明白如何执行此操作,我们应该分别执行这三个部分?我这样做,但我没有得到任何结果。结果存储在hive表中,但不会返回spark。
示例:(您可以看到3个部分的查询)
insert overwrite table q21_tmp1
select l_orderkey, count(distinct l_suppkey), max(l_suppkey) as max_suppkey
from lineitem
group by l_orderkey;
insert overwrite table q21_tmp2
select l_orderkey, count(distinct l_suppkey), max(l_suppkey) as max_suppkey
from lineitem
where l_receiptdate > l_commitdate
group by l_orderkey;
insert overwrite table q21_suppliers_who_kept_orders_waiting
select s_name, count(1) as numwait
from
(select s_name
from
(select s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey
from q21_tmp2 t2
right outer join
(select s_name, l_orderkey, l_suppkey
from
(select s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey
from q21_tmp1 t1
join
(select s_name, l_orderkey, l_suppkey
from orders o
join
(select s_name, l_orderkey, l_suppkey
from nation n
join supplier s on s.s_nationkey = n.n_nationkey
and n.n_name = 'SAUDI ARABIA'
join lineitem l on s.s_suppkey = l.l_suppkey
where l.l_receiptdate > l.l_commitdate) l1 on o.o_orderkey = l1.l_orderkey and o.o_orderstatus = 'F') l2 on l2.l_orderkey = t1.l_orderkey ) a where (count_suppkey > 1) or ((count_suppkey=1) and (l_suppkey <> max_suppkey)) ) l3 on l3.l_orderkey = t2.l_orderkey ) b where (count_suppkey is null) or ((count_suppkey=1) and (l_suppkey = max_suppkey)) )c group by s_name order by numwait desc, s_name limit 100;
答案 0 :(得分:-1)
我不确定我是否理解正确,但是在您将Spark环境与之关联的Hive表中插入数据后,您可以使用Spark SQL API访问它:
val result = sqlContext.sql("SELECT * FROM q21_tmp1")
result.show()
或
val df = sqlContext.table("q21_tmp1")
val result = df.select("*")
result.show()
有关详细信息,请查看Spark Documentation