我无法构建查询并让它在impala上运行。我创建了以下工作查询来连接两个表:
SELECT *
FROM illuminavariant as vcf, ensembl_genes as ens
WHERE vcf.filter = "PASS"
AND vcf.qual > 100
AND vcf.chromosome = ens.chromosome
AND vcf.position BETWEEN ens.start AND ens.stop
现在我正在尝试编写一个查找所有变体的查询WHERE vcf.filter =“PASS”和vcf.qual> 100,但没有染色体和位置的匹配。
我试过这个:
SELECT *
FROM p7dev.illumina_test, p7dev.ensembl_test
WHERE NOT EXISTS(
SELECT *
FROM p7dev.illumina_test as vcf, p7dev.ensembl_test as ens
WHERE vcf.chromosome = ens.chromosome
AND vcf.position BETWEEN ens.start AND ens.stop
)
但这并没有带来任何结果。我认为一个WITH子句可能会有所作为,但我真的很感激,如果有人能帮助我理解这将如何工作的逻辑。非常感谢!
答案 0 :(得分:2)
由于您正在寻找与任何集合无关的变体,因此您可以形成变体和集合的交叉连接来过滤掉行,这似乎很奇怪。如果这真的是你想要的,那么应该这样做:
SELECT *
FROM illuminavariant as vcf, ensembl_genes as ens
WHERE vcf.filter = "PASS"
AND vcf.qual > 100
AND (
vcf.chromosome != ens.chromosome
OR vcf.position < ens.start
OR vcf.position > ens.stop
)
这只是否定了将变量行与整体行相关联的条件。
我怀疑你真正想要的更像是这样:
SELECT vcf.*
FROM
illuminavariant as vcf
LEFT JOIN ensembl_genes as ens
ON vcf.chromosome = ens.chromosome
AND vcf.position BETWEEN ens.start AND ens.stop
WHERE
vcf.filter = "PASS"
AND vcf.qual > 100
AND ens.chromosome IS NULL
执行与第一个查询相同的连接,但作为左连接。实际表示匹配的行然后由ens.chromosome IS NULL
条件过滤掉。它只返回变量表的列,因为整个点是找到在集合表中没有对应行的变体。
答案 1 :(得分:1)
试试这个......
SELECT *
FROM p7dev.illumina_test vcf
WHERE NOT EXISTS( SELECT 1
FROM p7dev.ensembl_test as ens
WHERE vcf.chromosome = ens.chromosome
AND vcf.position BETWEEN ens.start AND ens.stop
)
AND vcf.filter = 'PASS'
AND vcf.qual > 100