这是我的表:
/user/hive/warehouse
我需要计算具体的频率,我给你举个例子:
每一行都是患者,所以" chr1 123 AA"有5名患者和#34; chr1 123 AC"有3个。
我想知道A的频率
计算结果是:
chr pos refalt
---------------
chr1 123 AA
chr1 123 AA
chr1 123 AA
chr1 123 AA
chr1 123 AA
chr1 123 AC
chr1 123 AC
chr1 123 AC
chr2 456 TC
chr3 789 GC
对于C:
13(A)
/16 , Because There are 13 people in "Chr1 123" who has A and in total they're 16 5XA (ref) 5XA(alt) + 3XA (ref) 3XC(alt)
如何在SQL中实现这一点太复杂了?
3(C)/16 , Because only 3 people has C
是Refalt
列,因此我需要拆分每个值才能获得ref和alt。
我知道有点复杂,只要问我更多细节。
答案 0 :(得分:0)
对于任何想要了解的人(特别是生物学家)如何实现这一目标:
select substring(refalt from 1 for 1),
count( substring(refalt from 1 for 1) )::numeric /
(select 2*count(*) from ft_variants where pos_chr like 'chr1 12783') as frequency_allele1
from ft_variants
where pos_chr like 'chr1 12783'
group by refalt
union
select substring(refalt from 2 for 1),
count( substring(refalt from 2 for 1) )::numeric /
(select 2*count(*) from ft_variants where pos_chr like 'chr1 12783') as frequency_allele2
from ft_variants
where pos_chr like 'chr1 12783'
group by refalt;