SparkSQL中带有WrappedArray的IN子句

时间:2018-06-29 02:29:42

标签: apache-spark-sql

我有一张表,其结构如下:

id  codes
 1  WrappedArray(A, B, C)
 2  WrappedArray(A)
 3  WrappedArray(B, D)

我想返回包含任何代码列表的行,很像SQL IN子句。

如果我尝试

with my_table as (
  select 1 as id, array('A','B','C') as codes
  union
  select 2 as id, array('A') as codes
  union
  select 3 as id, array('B', 'D') as codes
)
select *
  from my_table t
       lateral view explode(t.codes) as code
 where code in ( 'B', 'D')

我两次获得ID 3,因为它同时包含B和D代码。

我可以做类似的事情

with my_table as (
  select 1 as id, array('A','B','C') as codes
  union
  select 2 as id, array('A') as codes
  union
  select 3 as id, array('B', 'D') as codes
)
select id from my_table
 where id in (
       select id
         from my_table sub
              lateral view posexplode(sub.codes) as code_pos, code
        where code in ( 'B', 'D') )

但这需要我两次引用my_table。实际上,我的表很大,我宁愿避免本质上是自联接的事情,因为我已经有了评估主表条件所需的数据。

我想做这样的事情:

with my_table as (
  select 1 as id, array('A','B','C') as codes
  union
  select 2 as id, array('A') as codes
  union
  select 3 as id, array('B', 'D') as codes
)
select id
  from my_table t
 where exists ( select 1 from (select 0) lateral view explode(t.codes) as code where code in ( 'B', 'D') )

但是抛出一个

  

在外部不支持引用外部查询的表达式   WHERE / HAVING子句

array_contains看起来很接近我的需要,但是它只需要一个值,而不是值列表。

通常情况下,我的数据比本示例要复杂(数组元素是named_struct,而不是简单的字符串),但是我假设我可以根据自己的情况调整任何解决方案。 / p>

在没有纯SQL中的自联接的情况下可以做到吗?

0 个答案:

没有答案