如何将以下查询转换为与不支持子查询的Spark 1.6兼容:
SELECT ne.device_id, sp.device_hostname
FROM `table1` ne INNER JOIN `table2` sp
ON sp.device_hostname =
(SELECT device_hostname FROM `table2`
WHERE device_hostname LIKE
CONCAT(ne.device_id,'%') ORDER BY device_hostname DESC LIMIT 1)
我读过它支持FROM中指定的子查询但不支持WHERE,但以下也不起作用:
SELECT * FROM (SELECT ne.device_id, sp.device_hostname
FROM `table1` ne INNER JOIN `table2` sp
ON sp.device_hostname =
(SELECT device_hostname FROM `table2`
WHERE device_hostname LIKE
CONCAT(ne.device_id,'%') ORDER BY device_hostname DESC LIMIT 1)) AS TA
我的总体目标是加入两个表,但只获取table2中的最后一条记录。 SQL语句是有效的,但是当我在Spark中的HiveContext.sql中使用它时,我得到一个分析异常。
答案 0 :(得分:0)
您可以使用HiveContext
和窗口函数(参考How to select the first row of each group?)
scala> Seq((1L, "foo")).toDF("id", "device_id").registerTempTable("table1")
scala> Seq((1L, "foobar"), (2L, "foobaz")).toDF("id", "device_hostname").registerTempTable("table2")
scala> sqlContext.sql("""
| WITH tmp AS (
| SELECT ne.device_id, sp.device_hostname, row_number() OVER (PARTITION BY device_id ORDER BY device_hostname) AS rn
| FROM table1 ne INNER JOIN table2 sp
| ON sp.device_hostname LIKE CONCAT(ne.device_id, '%'))
| SELECT device_id, device_hostname FROM tmp WHERE rn = 1
| """).show
+---------+---------------+
|device_id|device_hostname|
+---------+---------------+
| foo| foobar|
+---------+---------------+
但只有两列可以聚合:
scala> sqlContext.sql("""
| WITH tmp AS (
| SELECT ne.device_id, sp.device_hostname
| FROM table1 ne INNER JOIN table2 sp
| ON sp.device_hostname LIKE CONCAT(ne.device_id, '%'))
| SELECT device_id, min(device_hostname) AS device_hostname
| FROM tmp GROUP BY device_id
|""").show
+---------+---------------+
|device_id|device_hostname|
+---------+---------------+
| foo| foobar|
+---------+---------------+
要提高效果,您应尝试将LIKE
替换为等同条件How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?