尝试创建sql或PIG查询,根据类型生成不同值的结果。
换句话说,鉴于此表:
Type: Value:
A x
B y
C y
B y
C z
A x
A z
A z
A x
B x
B z
B x
C x
我想得到以下结果:
Type: x: y: z:
A 3 0 2
B 2 2 1
C 1 1 1
此外,作为结果的平均表也很有用
Type: x: y: z:
A 0.60 0.00 0.40
B 0.40 0.40 0.20
C 0.33 0.33 0.33
编辑4
我是PIG的一个小说,但是读了8个不同的堆栈溢出帖我想出了这个。
当我使用此PIG查询时
A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;
一些作品.......
我明白了:
(0,x,y,z)
(A,3,0,2)
(B,2,2,1)
(C,1,1,1)
我可以导入到文本文件中去除"("和")"并使用模式为第一行的CSV。这种作品 SO SLOW 。我想要一个更好,更快,更清洁的方法。如果有人知道某种方式,请告诉我。
答案 0 :(得分:0)
我能想到的最好的方法只适用于Oracle,虽然它不会为每个值提供一个列,但它会显示如下数据:
A x=3,y=3,z=3
B x=4,y=3
C y=3,z=2
当然如果您有900个值,它会显示:
A x=3,y=6,...,ff=12
等...
我无法添加评论所以我不能问你oracle是否正常。无论如何这里的查询将实现:
SELECT type, values FROM
(SELECT type, SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values, seq,
MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(SELECT type, value, COUNT(*) OCC
FROM tableName
GROUP BY type, value))
START WITH seq=1
CONNECT by PRIOR
seq+1=seq
AND PRIOR
type=type)
WHERE seq = max;
对于平均值,您需要在所有其他内容之前添加信息,这是代码:
SELECT * FROM
(SELECT type,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || (OCC / TOT), ','),2) average,
seq, MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, TOT, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(
SELECT type, value, TOT, COUNT(*) OCC
FROM (SELECT type, value, COUNT(*) OVER (partition by type) TOT
FROM tableName)
GROUP BY type, value, TOT
))
START WITH seq=1
CONNECT by PRIOR
seq+1=seq
AND PRIOR
type=type)
WHERE seq = max;
答案 1 :(得分:0)
根据问题编辑#3更新了代码:
A = load '/path/to/input/file' using AvroStorage();
B = group A by (type, value);
C = foreach B generate flatten(group) as (type, value), COUNT(A) as count;
-- Now get all the values.
M = foreach A generate value;
-- Left Outer Join all the values with C, so that every type has exactly same number of values associated
N = join M by value left outer, C by value;
O = foreach N generate
C::type as type,
M::value as value,
(C::count == null ? 0 : C::count) as count; --count = 0 means value was not associated with the type
P = group O by type;
Q = foreach P {
R = order O by value asc; --Ordered by value, so values counts are ordered consistently in all the rows.
generate group as type, flatten(R.count);
}
请注意,我没有执行上面的代码。这些只是代表性的步骤。
答案 2 :(得分:0)
你可以使用Brickhouse中的向量操作UDF(http://github.com/klout/brickhouse)来考虑每个'值'是一个非常高维空间的维度。您可以将单个值实例解释为该维度中的向量,值为1.在Hive中,我们将这样的向量简单地表示为以字符串作为键的映射,并将int或其他数字表示为值。
您要创建的是一个向量,它是所有向量的总和,按类型分组。查询将是:
SELECT type,
union_vector_sum( map( value, 1 ) ) as vector,
FROM table
GROUP BY type;
Brickhouse甚至具有标准化功能,可以产生平均值'
SELECT type,
vector_normalize(union_vector_sum( map( value, 1 ) ))
as normalized_vector,
FROM table
GROUP BY type;
答案 3 :(得分:0)
A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;