我希望将一些R代码移植到Hadoop,以便与Impala或Hive一起使用类似SQL的查询。 我的代码基于这个问题:
R data table: compare row value to group values, with condition
点si为每一行找到子组1中具有相同id的行数,价格更便宜。
假设我有以下数据:
CREATE TABLE project
(
id int,
price int,
subgroup int
);
INSERT INTO project(id,price,subgroup)
VALUES
(1, 10, 1),
(1, 10, 1),
(1, 12, 1),
(1, 15, 1),
(1, 8, 2),
(1, 11, 2),
(2, 9, 1),
(2, 12, 1),
(2, 14, 2),
(2, 18, 2);
现在,以下查询在Impala中适用于子组1中的行:
select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 1
但我还需要处理子组2中的行。
所以我想要的输出是:
id price subgroup cheaper
1 10 1 0 ( because no row is cheaper in id 1 subgroup 1)
1 10 1 0 ( because no row is cheaper in id 1 subgroup 1)
1 12 1 2 ( rows 1 and 2 are cheaper)
1 15 1 3
1 8 2 0 (nobody is cheaper in id 1 and subgroup 1)
1 11 2 2
2 9 1 0
2 12 1 1
2 14 2 2
2 18 2 2
答案 0 :(得分:1)
我基本上遇到了同样的问题。这就像你需要一个窗口函数,你可以放入where
子句。为了解决这个问题,我把价格收集到一个数组(其中subgroup = 1)并自行加入表格。然后我写了一个UDF来过滤给定谓词的数组。
<强> UDF 强>:
package somepkg;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;
public class FilterArrayUDF extends UDF {
public ArrayList<Integer> evaluate(ArrayList<Text> arr, int p) {
ArrayList<Integer> newList = new ArrayList<Integer>();
for (i = 0; i < arr.size(); i++) {
int elem = Integer.parseInt((arr.get(i)).toString());
if (elem < p)
newList.add(elem);
}
return newList;
}
}
然后,如果你有过滤后的数组,你可以采用它的大小。
<强>查询强>:
add jar /path/to/jars/hive-udfs.jar;
create temporary function filter_arr as 'somepkg.FilterArrayUDF';
select B.id, price, subgroup, price_arr
, filter_arr(price_arr, price) cheaper_arr
, size(filter_arr(price_arr, price)) cheaper
from db.tbl B
join (
select id, collect_list(price) price_arr
from db.tbl
where subgroup = 1
group by id ) A
on B.id = A.id
<强>输出强>:
1 10 1 [10,10,12,15] [] 0
1 10 1 [10,10,12,15] [] 0
1 12 1 [10,10,12,15] [10,10] 2
1 15 1 [10,10,12,15] [10,10,12] 3
1 8 2 [10,10,12,15] [] 0
1 11 2 [10,10,12,15] [10,10] 2
2 9 1 [9,12] [] 0
2 12 1 [9,12] [9] 1
2 14 2 [9,12] [9,12] 2
2 18 2 [9,12] [9,12] 2
答案 1 :(得分:0)
我们可以尝试以下查询: -
select * from
(
select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 1 union
select *, rank() over (partition by id order by price asc) - 1 as cheaper
from project
where subgroup = 2) result