Question

我希望将一些R代码移植到Hadoop，以便与Impala或Hive一起使用类似SQL的查询。我的代码基于这个问题：

R data table: compare row value to group values, with condition

点si为每一行找到子组1中具有相同id的行数，价格更便宜。

假设我有以下数据：

CREATE TABLE project
(
    id int,
    price int, 
    subgroup int
);

INSERT INTO project(id,price,subgroup) 
VALUES
    (1, 10, 1), 
    (1, 10, 1), 
    (1, 12, 1),
    (1, 15, 1),
    (1,  8, 2),
    (1, 11, 2),
    (2,  9, 1),
    (2, 12, 1),
    (2, 14, 2),
    (2, 18, 2);

现在，以下查询在Impala中适用于子组1中的行：

select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 1

但我还需要处理子组2中的行。

所以我想要的输出是：

id  price   subgroup   cheaper
1   10      1          0 ( because no row is cheaper in id 1 subgroup 1)
1   10      1          0 ( because no row is cheaper in id 1 subgroup 1)
1   12      1          2 ( rows 1 and 2 are cheaper)
1   15      1          3
1    8      2          0 (nobody is cheaper in id 1 and subgroup 1)
1   11      2          2
2    9      1          0
2   12      1          1
2   14      2          2
2   18      2          2

Answer 1

我基本上遇到了同样的问题。这就像你需要一个窗口函数，你可以放入where子句。为了解决这个问题，我把价格收集到一个数组（其中subgroup = 1）并自行加入表格。然后我写了一个UDF来过滤给定谓词的数组。

<强> UDF ：

package somepkg;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;

public class FilterArrayUDF extends UDF {
    public ArrayList<Integer> evaluate(ArrayList<Text> arr, int p) {
        ArrayList<Integer> newList = new ArrayList<Integer>();

        for (i = 0; i < arr.size(); i++) {
            int elem = Integer.parseInt((arr.get(i)).toString());
            if (elem < p)
                newList.add(elem);
        }
        return newList;
    }
}

然后，如果你有过滤后的数组，你可以采用它的大小。

<强>查询：

add jar /path/to/jars/hive-udfs.jar;
create temporary function filter_arr as 'somepkg.FilterArrayUDF';

select B.id, price, subgroup, price_arr
  , filter_arr(price_arr, price) cheaper_arr
  , size(filter_arr(price_arr, price)) cheaper
from db.tbl B
join (
  select id, collect_list(price) price_arr
  from db.tbl
  where subgroup = 1
  group by id ) A
on B.id = A.id

<强>输出：

1    10    1    [10,10,12,15]    []               0
1    10    1    [10,10,12,15]    []               0
1    12    1    [10,10,12,15]    [10,10]          2
1    15    1    [10,10,12,15]    [10,10,12]       3
1    8     2    [10,10,12,15]    []               0
1    11    2    [10,10,12,15]    [10,10]          2
2    9     1    [9,12]           []               0
2    12    1    [9,12]           [9]              1
2    14    2    [9,12]           [9,12]           2
2    18    2    [9,12]           [9,12]           2

Answer 2

我们可以尝试以下查询： -

select * from 
    (
    select *, rank() over (partition by id order by price asc) - 1 as cheaper
    from project
    where subgroup = 1 union
    select *, rank() over (partition by id order by price asc) - 1 as cheaper
    from project
    where subgroup = 2) result

Hadoop查询将行值与组值进行比较，条件为

2 个答案: