Hive collect_set()

时间:2017-03-27 19:21:32

标签: sql hive

假设我有两个表:timeperiod1timeperiod2

timeperiod1有一个类似的架构:

cluster  characteristic
A        1
A        2
A        3
B        2
B        3

timeperiod2有一个类似的架构:

cluster  characteristic
A        1
A        2
B        2
B        3
B        4

我想按群集计算两个时间段(即表格)之间的集合差异。我的计划(请告诉我任何更好的方法)这样做是为了1)collect_set(我知道怎么做)然后2)比较set_difference(我不知道该怎么做)。

1) 我这样做:

CREATE TABLE collect_char_wk1 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod1
GROUP BY cluster;

CREATE TABLE collect_char_wk2 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod2
GROUP BY cluster;

获取collect_char_wk1

cluster  characteristic
A        [1,2,3]
B        [2,3]

并获得collect_char_wk2

cluster  characteristic
A        [1,2]
B        [2,3,4]

2) 是否有可用于计算设置差异的Hive功能?我不熟悉Java编写自己的set_diff()Hive UDF / UDAF。

结果应该是一个表set_diff_wk1_to_wk2

cluster  set_diff
A        1
B        0

上面是一个玩具示例,我的实际数据是数百亿行的几列,因此需要一个计算上有效的解决方案。我的数据存储在HDFS中,我使用的是HiveQL + Python。

4 个答案:

答案 0 :(得分:1)

如果您尝试获取period1中不在句点2中的每个群集的特征数,则只需使用left joingroup by

select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff
from timeperiod1 t1
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic
group by t1.cluster

答案 1 :(得分:1)

$('#input-newsearch-2').val() = value01,value02,value03
select      cluster

           ,count(*)                                          as count_total_characteristic 
           ,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2
           ,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1
           ,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2

           ,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2
           ,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1
           ,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2

from       (select      cluster
                       ,characteristic
                       ,max(case when tab = 1 then 1 else 0 end) as in_1
                       ,max(case when tab = 2 then 1 else 0 end) as in_2

            from        (           select 1 as tab,cluster,characteristic from timeperiod1
                        union all   select 2 as tab,cluster,characteristic from timeperiod2
                        ) t

            group by    cluster
                       ,characteristic
            ) t

group by    cluster

order by    cluster
;

答案 2 :(得分:1)

您可以使用brickhouse UDF,它具有许多功能,可以执行您描述的操作。更具体地说,您可以使用Wiki中解释的 set_diff 。 README文件将指导您如何创建jar文件。

您可以在查询中包含jar文件:

from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import numpy as np
import matplotlib.pyplot as plt
from keras import backend as K

model = VGG16(weights='imagenet', include_top=False)

img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

# attempt to print the output of neuron 0 of the last layer
features = model.predict(x)    
out = np.squeeze(features, axis=0)
print(out.shape) # (7,7, 512)
out2 = K.permute_dimensions(out, (2, 0, 1))
print(out2.shape) # (512, 7, 7) -> moved it because I assumed it was 512 outputs of format 7 x 7 (?!?)
plt.imshow(out2[0]) 

# attempt to print the output of the third convolutional layer
get_feature = K.function([model.layers[0].input], [model.layers[3].output])
feat = get_feature([x])[0]
plt.imshow(feat)

然后使用本指南访问这些功能: https://github.com/klout/brickhouse/blob/master/src/main/resources/brickhouse.hql

希望这有帮助。

答案 3 :(得分:0)

SELECT 
 t1.cluster t1_cluster, t2.cluster t2_cluster,
 COLLECT_SET(t1.characteristic) as t1_set, 
 COLLECT_SET(t2.characteristic) as t2_set,
 (SIZE(COLLECT_SET(t1.characteristic)) - 
  SIZE(COLLECT_SET(t2.characteristic))) 
 as set_diff
FROM timeperiod1 t1
INNER JOIN timeperiod2 t2 ON (t1.cluster=t2.cluster)
GROUP BY t1.cluster, t2.cluster;

这给出了集合中的差异,但是,您将需要一个python函数来从集合中返回实际的缺失值。希望对您有帮助