按平均值排序,然后使用Hive插入新表

时间:2013-04-28 07:56:14

标签: sql mapreduce hive

我有一个大型数据文件,如下所示:

    1   6
    1   6
    2   7
    3   2
    3   6
    1   7
    1   9
    2   9
    1   5
    3   9
    3   1
    2   8

我希望按第一列对数据进行分组,找到每个第一列值的第二列平均值,然后按第二列平均值对这些分组进行排序。所以输出应该是:

    2   8
    1   6.6
    3   4.5

我的代码现在看起来像这样,并且不起作用:

    CREATE EXTERNAL TABLE as (a STRING, b INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION 's3n://myfolder/hive';

    CREATE EXTERNAL TABLE output(a STRING, avgb DOUBLE)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LOCATION 's3n://myfolder/hive';

    load data inpath "s3n://myfolder/file.txt" into TABLE as;
    insert overwrite output select a, avg(b) from as group by a order by avg(b) DESC limit 1000;

我应该注意以下内容可以正常工作,但有些东西不适用于订单,并在SQL中插入适合我的步骤:

    select a, avg(b) from as group by a;

当我尝试:

    select a, avg(b) from as group by a order by avg(b);

我得到“FAILED:语义分析错误:行1:66无效的表别名或列引用'b':(可能的列名是:_col0,_col1)。

1 个答案:

答案 0 :(得分:4)

只需在子查询中将其移出:

select a
from (select a, avg(b) as avgb from as group by a) as t
order by avgb;