Question

我有一个查询来加入表格。如何优化以更快地运行它？

val q = """
          | select a.value as viewedid,b.other as otherids
          | from bm.distinct_viewed_2610 a, bm.tets_2610 b
          | where FIND_IN_SET(a.value, b.other) != 0 and a.value in (
          |   select value from bm.distinct_viewed_2610)
          |""".stripMargin
val rows = hiveCtx.sql(q).repartition(100)

表格描述：

hive> desc distinct_viewed_2610;
OK
value                   string

hive> desc tets_2610;
OK
id                      int                                         
other                   string

数据如下所示：

hive> select * from distinct_viewed_2610 limit 5;
OK
1033346511
1033419148
1033641547
1033663265
1033830989

和

hive> select * from tets_2610 limit 2;
OK

1033759023
103973207,1013425393,1013812066,1014099507,1014295173,1014432476,1014620707,1014710175,1014776981,1014817307,1023740250,1031023907,1031188043,1031445197

distinct_viewed_2610表有110万条记录，我试图从表tets_2610中得到类似的id，通过拆分第二列有20万行。

对于10万条记录，用两台机器完成工作需要8.5小时一个16 gb ram和16个核心第二个是8 gb ram和8个核心。

有没有办法优化查询？

spark-executor pic

Answer 1

现在你正在做笛卡尔加入。笛卡尔连接为您提供1.1M * 200K = 220亿行。笛卡尔加入后，按where FIND_IN_SET(a.value, b.other) != 0

过滤

分析您的数据。如果＆＃39;其他＆＃39;字符串平均包含10个元素然后爆炸它将在表b中为您提供2.2M行。如果假设只有1/10的行会加入，那么由于INNER JOIN，你将有2.2M / 10 = 220K行。

如果这些假设是正确的，那么爆炸数组和连接将比笛卡尔连接+过滤器表现更好。

select distinct a.value as viewedid, b.otherids
  from bm.distinct_viewed_2610 a
       inner join (select e.otherid, b.other as otherids 
                     from bm.tets_2610 b
                          lateral view explode (split(b.other ,',')) e as otherid
                  )b on a.value=b.otherid

你不需要这个：

and a.value in (select value from bm.distinct_viewed_2610)

抱歉，我无法测试查询，请自己动手。

Answer 2

如果您根据数据使用orc formate更改为镶木地板，我会说选择范围分区。

选择适当的并行化以快速执行。

我已经回答以下链接可能会对你有帮助。

Spark doing exchange of partitions already correctly distributed

另请阅读

http://dev.sortable.com/spark-repartition/

如何优化加入？

2 个答案: