计算Hive中的偏差?

时间:2014-12-29 17:59:06

标签: statistics hive

This paper将样本偏斜定义为

  

s = E [X-E(X)] ^ 3 / [Var(X)] ^ 3/2

在Hive中计算此内容的最简单方法是什么?

我想象一个两遍算法:1得到E(X)和Var(X),另一个计算E [X-(X)] ^ 3并将其卷起来。

2 个答案:

答案 0 :(得分:1)

我认为你采用两步法是正确的,特别是如果你严格使用Hive。以下是通过两个步骤或一个查询和一个子查询来实现此目的的一种方法:

  1. 使用OVER()子句计算E(X),这样我们可以避免聚合数据(这样我们以后可以计算E [X-E(X)]):

    select x, avg(x) over () as e_x  
    from table;
    
  2. 使用上面的子查询,计算Var(x)和E [X-E(X)],它将聚合数据并产生最终统计数据:

    select pow(avg(x - e_x), 3)/sqrt(pow(variance(x), 3))
    from (select x, avg(x) over () as e_x 
          from table) tb
    ;
    

答案 1 :(得分:0)

上述公式至少对于Pearson的偏差是不正确的。

以下内容至少适用于Impala:

with d as (select somevar as x from yourtable where what>2),
agg as (select avg(x) as m,STDDEV_POP(x) as s,count(*) as n from d),
sk as (select avg(pow(((x-m)/s),3)) as skew from d,agg)
select skew,m,s,n from agg,sk;

我通过以下方式测试了它:

with dual as (select 1.0 as x),
d as (select 1*x as x from dual union select 2*x from dual union select 4*x from dual union select 8*x from dual union select 16*x from dual union select 32*x from dual), -- This generates 1,2,4,8,16,32
agg as (select avg(x) as m,STDDEV_POP(x) as s,count(*) as n from d),
sk as (select avg(pow(((x-m)/s),3)) as skew from d,agg)
select skew,m,s,n from agg,sk;

它给出与R相同的答案:

require(moments)
skewness(c(1,2,4,8,16,32)) #gives 1.095221

请参阅https://en.wikipedia.org/wiki/Skewness#Pearson.27s_moment_coefficient_of_skewness