假设我有两个表:timeperiod1
和timeperiod2
。
timeperiod1
有一个类似的架构:
cluster characteristic
A 1
A 2
A 3
B 2
B 3
timeperiod2
有一个类似的架构:
cluster characteristic
A 1
A 2
B 2
B 3
B 4
我想按群集计算两个时间段(即表格)之间的集合差异。我的计划(请告诉我任何更好的方法)这样做是为了1)collect_set(我知道怎么做)然后2)比较set_difference(我不知道该怎么做)。
1) 我这样做:
CREATE TABLE collect_char_wk1 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod1
GROUP BY cluster;
CREATE TABLE collect_char_wk2 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod2
GROUP BY cluster;
获取collect_char_wk1
:
cluster characteristic
A [1,2,3]
B [2,3]
并获得collect_char_wk2
:
cluster characteristic
A [1,2]
B [2,3,4]
2) 是否有可用于计算设置差异的Hive功能?我不熟悉Java编写自己的set_diff()Hive UDF / UDAF。
结果应该是一个表set_diff_wk1_to_wk2
:
cluster set_diff
A 1
B 0
上面是一个玩具示例,我的实际数据是数百亿行的几列,因此需要一个计算上有效的解决方案。我的数据存储在HDFS中,我使用的是HiveQL + Python。
答案 0 :(得分:1)
如果您尝试获取period1中不在句点2中的每个群集的特征数,则只需使用left join
和group by
。
select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff
from timeperiod1 t1
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic
group by t1.cluster
答案 1 :(得分:1)
$('#input-newsearch-2').val() = value01,value02,value03
select cluster
,count(*) as count_total_characteristic
,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2
,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1
,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2
,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2
,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1
,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2
from (select cluster
,characteristic
,max(case when tab = 1 then 1 else 0 end) as in_1
,max(case when tab = 2 then 1 else 0 end) as in_2
from ( select 1 as tab,cluster,characteristic from timeperiod1
union all select 2 as tab,cluster,characteristic from timeperiod2
) t
group by cluster
,characteristic
) t
group by cluster
order by cluster
;
答案 2 :(得分:1)
您可以使用brickhouse UDF,它具有许多功能,可以执行您描述的操作。更具体地说,您可以使用Wiki中解释的 set_diff 。 README文件将指导您如何创建jar文件。
您可以在查询中包含jar文件:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import numpy as np
import matplotlib.pyplot as plt
from keras import backend as K
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
# attempt to print the output of neuron 0 of the last layer
features = model.predict(x)
out = np.squeeze(features, axis=0)
print(out.shape) # (7,7, 512)
out2 = K.permute_dimensions(out, (2, 0, 1))
print(out2.shape) # (512, 7, 7) -> moved it because I assumed it was 512 outputs of format 7 x 7 (?!?)
plt.imshow(out2[0])
# attempt to print the output of the third convolutional layer
get_feature = K.function([model.layers[0].input], [model.layers[3].output])
feat = get_feature([x])[0]
plt.imshow(feat)
然后使用本指南访问这些功能: https://github.com/klout/brickhouse/blob/master/src/main/resources/brickhouse.hql
希望这有帮助。
答案 3 :(得分:0)
SELECT
t1.cluster t1_cluster, t2.cluster t2_cluster,
COLLECT_SET(t1.characteristic) as t1_set,
COLLECT_SET(t2.characteristic) as t2_set,
(SIZE(COLLECT_SET(t1.characteristic)) -
SIZE(COLLECT_SET(t2.characteristic)))
as set_diff
FROM timeperiod1 t1
INNER JOIN timeperiod2 t2 ON (t1.cluster=t2.cluster)
GROUP BY t1.cluster, t2.cluster;
这给出了集合中的差异,但是,您将需要一个python函数来从集合中返回实际的缺失值。希望对您有帮助