PIG查询以使用类似行的数量来旋转行和列

时间:2014-08-12 14:51:27

标签: sql hadoop apache-pig

尝试创建sql或PIG查询,根据类型生成不同值的结果。

换句话说,鉴于此表:

Type:    Value:
A        x
B        y
C        y
B        y
C        z
A        x
A        z
A        z
A        x
B        x
B        z
B        x
C        x

我想得到以下结果:

Type:    x:    y:    z:
A         3     0     2
B         2     2     1
C         1     1     1

此外,作为结果的平均表也很有用

Type:    x:    y:    z:
A         0.60  0.00  0.40
B         0.40  0.40  0.20 
C         0.33  0.33  0.33

编辑4

我是PIG的一个小说,但是读了8个不同的堆栈溢出帖我想出了这个。

当我使用此PIG查询时

A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;

一些作品.......

我明白了:

(0,x,y,z)
(A,3,0,2)
(B,2,2,1)
(C,1,1,1)

我可以导入到文本文件中去除"("和")"并使用模式为第一行的CSV。这种作品 SO SLOW 。我想要一个更好,更快,更清洁的方法。如果有人知道某种方式,请告诉我。

4 个答案:

答案 0 :(得分:0)

我能想到的最好的方法只适用于Oracle,虽然它不会为每个值提供一个列,但它会显示如下数据:

A   x=3,y=3,z=3
B   x=4,y=3
C   y=3,z=2

当然如果您有900个值,它会显示:

A  x=3,y=6,...,ff=12 

等...

我无法添加评论所以我不能问你oracle是否正常。无论如何这里的查询将实现:

SELECT type, values FROM 
(SELECT type, SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values, seq, 
MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(SELECT type, value, COUNT(*) OCC
FROM tableName
GROUP BY type, value))
START WITH seq=1
CONNECT by PRIOR
  seq+1=seq
  AND PRIOR 
    type=type)
WHERE seq = max;

对于平均值,您需要在所有其他内容之前添加信息,这是代码:

SELECT * FROM 
(SELECT type, 
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || (OCC / TOT), ','),2) average, 
seq, MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, TOT, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(
SELECT type, value, TOT, COUNT(*) OCC
FROM (SELECT type, value, COUNT(*) OVER (partition by type) TOT
FROM tableName)
GROUP BY type, value, TOT
))
START WITH seq=1
CONNECT by PRIOR
  seq+1=seq
  AND PRIOR 
    type=type)
WHERE seq = max;

答案 1 :(得分:0)

根据问题编辑#3更新了代码:

A = load '/path/to/input/file' using AvroStorage();
B = group A by (type, value);
C = foreach B generate flatten(group) as (type, value), COUNT(A) as count;

-- Now get all the values.
M = foreach A generate value;

-- Left Outer Join all the values with C, so that every type has exactly same number of values associated
N = join M by value left outer, C by value;
O = foreach N generate 
                  C::type as type, 
                  M::value as value, 
                  (C::count == null ? 0 : C::count) as count; --count = 0 means value was not associated with the type
P = group O by type;
Q = foreach P {
                  R = order O by value asc;  --Ordered by value, so values counts are ordered consistently in all the rows.
                  generate group as type, flatten(R.count);
              }

请注意,我没有执行上面的代码。这些只是代表性的步骤。

答案 2 :(得分:0)

你可以使用Brickhouse中的向量操作UDF(http://github.com/klout/brickhouse)来考虑每个'值'是一个非常高维空间的维度。您可以将单个值实例解释为该维度中的向量,值为1.在Hive中,我们将这样的向量简单地表示为以字符串作为键的映射,并将int或其他数字表示为值。

您要创建的是一个向量,它是所有向量的总和,按类型分组。查询将是:

SELECT type, 
  union_vector_sum( map( value, 1 ) ) as vector,
FROM table
GROUP BY type;

Brickhouse甚至具有标准化功能,可以产生平均值'

SELECT type, 
  vector_normalize(union_vector_sum( map( value, 1 ) ))
     as normalized_vector,
FROM table
GROUP BY type;

答案 3 :(得分:0)

A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;