Question

假设我有两个表：timeperiod1和timeperiod2。

timeperiod1有一个类似的架构：

cluster  characteristic
A        1
A        2
A        3
B        2
B        3

timeperiod2有一个类似的架构：

cluster  characteristic
A        1
A        2
B        2
B        3
B        4

我想按群集计算两个时间段（即表格）之间的集合差异。我的计划（请告诉我任何更好的方法）这样做是为了1）collect_set（我知道怎么做）然后2）比较set_difference（我不知道该怎么做）。

1）我这样做：

CREATE TABLE collect_char_wk1 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod1
GROUP BY cluster;

CREATE TABLE collect_char_wk2 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod2
GROUP BY cluster;

获取collect_char_wk1：

cluster  characteristic
A        [1,2,3]
B        [2,3]

并获得collect_char_wk2：

cluster  characteristic
A        [1,2]
B        [2,3,4]

2）是否有可用于计算设置差异的Hive功能？我不熟悉Java编写自己的set_diff（）Hive UDF / UDAF。

结果应该是一个表set_diff_wk1_to_wk2：

cluster  set_diff
A        1
B        0

上面是一个玩具示例，我的实际数据是数百亿行的几列，因此需要一个计算上有效的解决方案。我的数据存储在HDFS中，我使用的是HiveQL + Python。

Answer 1

如果您尝试获取period1中不在句点2中的每个群集的特征数，则只需使用left join和group by。

select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff
from timeperiod1 t1
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic
group by t1.cluster

Answer 2

$('#input-newsearch-2').val() = value01,value02,value03

select      cluster

           ,count(*)                                          as count_total_characteristic 
           ,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2
           ,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1
           ,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2

           ,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2
           ,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1
           ,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2

from       (select      cluster
                       ,characteristic
                       ,max(case when tab = 1 then 1 else 0 end) as in_1
                       ,max(case when tab = 2 then 1 else 0 end) as in_2

            from        (           select 1 as tab,cluster,characteristic from timeperiod1
                        union all   select 2 as tab,cluster,characteristic from timeperiod2
                        ) t

            group by    cluster
                       ,characteristic
            ) t

group by    cluster

order by    cluster
;

Answer 3

您可以使用brickhouse UDF，它具有许多功能，可以执行您描述的操作。更具体地说，您可以使用Wiki中解释的 set_diff 。 README文件将指导您如何创建jar文件。

您可以在查询中包含jar文件：

from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import numpy as np
import matplotlib.pyplot as plt
from keras import backend as K

model = VGG16(weights='imagenet', include_top=False)

img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

# attempt to print the output of neuron 0 of the last layer
features = model.predict(x)    
out = np.squeeze(features, axis=0)
print(out.shape) # (7,7, 512)
out2 = K.permute_dimensions(out, (2, 0, 1))
print(out2.shape) # (512, 7, 7) -> moved it because I assumed it was 512 outputs of format 7 x 7 (?!?)
plt.imshow(out2[0]) 

# attempt to print the output of the third convolutional layer
get_feature = K.function([model.layers[0].input], [model.layers[3].output])
feat = get_feature([x])[0]
plt.imshow(feat)

然后使用本指南访问这些功能： https://github.com/klout/brickhouse/blob/master/src/main/resources/brickhouse.hql

希望这有帮助。

Answer 4

SELECT 
 t1.cluster t1_cluster, t2.cluster t2_cluster,
 COLLECT_SET(t1.characteristic) as t1_set, 
 COLLECT_SET(t2.characteristic) as t2_set,
 (SIZE(COLLECT_SET(t1.characteristic)) - 
  SIZE(COLLECT_SET(t2.characteristic))) 
 as set_diff
FROM timeperiod1 t1
INNER JOIN timeperiod2 t2 ON (t1.cluster=t2.cluster)
GROUP BY t1.cluster, t2.cluster;

这给出了集合中的差异，但是，您将需要一个python函数来从集合中返回实际的缺失值。希望对您有帮助

Hive collect_set（）

4 个答案: