Question

我有以下架构，想要检索至少1＆＃39; af＆＃39; ＆＃39; dbsnpAnnots＆＃39;内的字段场小于0.1

scala> randVarsDF.printSchema
root
|-- chr: string (nullable = true)
|-- pos: long (nullable = true)
|-- ref: string (nullable = true)
|-- alt: string (nullable = true)
|-- dbsnpAnnots: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- af: double (nullable = true)
|    |    |-- common: boolean (nullable = true)
|    |    |-- rsid: string (nullable = true)

我知道如何使用UDF和DataSet api执行此操作，但我也希望能够在SQL中执行此操作。

这就是我现在所做的事情：

select count(*) from RANDVARS where dbsnpAnnots[0].af < 0.1 or dbsnpAnnots[1].af < 0.1 or dbsnpAnnots[2].af < 0.1

这只搜索dbsnpAnnots数组中的前3个元素。我想搜索所有元素，因为可以有超过3个。

我也试过

select count(*) from RANDVARS where dbsnpAnnots[*].af < 0.1

但这不是一个有效的Spark SQL查询。

有什么想法吗？

Answer 1

你需要爆炸那个数组。由于它是一个struct数组，因此您可以使用inline

select count(1) 
from (
  select inline(dbsnpAnnots) from RANDVARS 
) p 
where p.af < 0.1

您如何在Spark SQL中查询至少1行数组类型中存在的内容？

1 个答案: