猪的高级枢轴表

时间:2014-03-14 22:27:04

标签: apache-pig

我已经在StackOverflow上检查了另外两个Pig pivot问题但没有成功。这有点不同。

我想写一个通用的pivot函数,我不知道前面的架构。更糟糕的是,我需要转向任意数量的列并生成新列,类似于Excel pivot的工作方式。例如:

user year  make    model     mileage 
=======================================
123  2011  Ford    Taurus    19.2 
123  2011  Subaru  Forester  23.9
123  2012  Nissan  Altima    25.6
123  2013  Ford    Taurus    21.8

假设我想在这种情况下转向用户ID和年份:

user year  Ford_Taurus_mileage  Subaru_Forester_mileage  Nissan_Altima_mileage
=================================================================================
123  2011  19.2                 23.9
123  2012                                                25.6
123  2013  21.8

上面的Excel配置是两个行标签(用户和年份),单个值列(里程数)和两个列标签(品牌和型号)。

我开始认识到猪可能不可能,但我想在这里发布以防万一。我曾考虑让用户提前提供所有列(到UDF),以便可以构建模式,但即便如此,我如何将所有行合并在一起(例如,在2011年,我们从两个行合并)行到一行)。

任何帮助将不胜感激。感谢。

2 个答案:

答案 0 :(得分:0)

虽然它的美学上有问题,但这确实是可能的。 Pig不知道您的模型和品牌可以采用的所有不同值的名称,因此您必须执行embedded Pig script并提取变量的级别。

此脚本处理 n 许多模型/制作并生成您请求的输出类型。要运行它,请在同一目录中键入pig -x local pivot.py(或您决定命名文件的任何内容,如果不是pivot.py)。

import collections
from org.apache.pig.scripting import *
input_path = 'tmp.txt' #Set to whatever your input path and filename are
#First, we run an embedded job to find all the distinct levels of model and make
find_distincts = """
A = LOAD '$INPUT' USING PigStorage() AS (user:chararray
        , year:chararray
        , make:chararray
        , model:chararray
        , mileage:chararray);
B = FOREACH A GENERATE make, model;
C = DISTINCT B;
DUMP C;
"""
P = Pig.compile(find_distincts)
output = P.bind({'INPUT':input_path}).runSingle()
#Gather the models and makes from the output of the Pig script
cars = []
CarRecord = collections.namedtuple('CarRecord', 'make model')
for x in output.result("C").iterator():
        cars.append(CarRecord(make=x.get(0),model=x.get(1)))
#Next, we create a series of conditionals based off these distinct values
pivot_str = ""
cut_str = ""
#List of filters
for car in cars:
        cut_str += "%s_%s_cut" % car + "= FOREACH A GENERATE (make == '%s' AND model == '%s'" % car + "?mileage:0) AS mileage;"
#Output schema for rows we grouped by
pivot_str += "GENERATE FLATTEN(group.user) AS user, FLATTEN(group.year) AS year"
#Output schema for columns we grouped by
for car in cars:
        pivot_str += ', FLATTEN(%s_%s_cut.mileage)' % car + ' AS %s_%s_mileage' % car
pivot_str += ';'
#If you stopped the script here, it almost works--
#this approach yields duplicate records, so we have to enact a DISTINCT.
#It also produces every element of a (user,year) set, not just the
#intersection. To solve this, I sum the rows and keep only the greatest row.
sum_str = 'FOREACH C GENERATE user.., (%s_%s_mileage' % cars[0]
for car in cars[1:]:
        sum_str += ' + %s_%s_mileage' % car
sum_str += ') AS user_year_sum;'
car_str = "%s_%s_mileage" % cars[0]
for car in cars[1:]:
        car_str += ", %s_%s_mileage" % car
car_str += ';'
create_pivot = """
A = LOAD '$INPUT' USING PigStorage() AS (user:chararray
        , year:chararray
        , make:chararray
        , model:chararray
        , mileage:float);
B = FOREACH (GROUP A BY (user, year)){
        %s
        %s
};
C = DISTINCT B;
D = %s
E = GROUP D BY (user, year);
F = FOREACH E GENERATE group.user, group.year, MAX(D.user_year_sum) AS greatest;
G = JOIN F BY (user, year, greatest), D BY (user, year, user_year_sum);
out = FOREACH G GENERATE F::user AS user, F::year AS year, %s
rmf pivoted_results;
STORE out INTO 'pivoted_results';
DESCRIBE out;
""" % (cut_str,pivot_str,sum_str,car_str)
print create_pivot
create_pivot_P = Pig.compile(create_pivot)
output = create_pivot_P.bind({'INPUT':input_path}).runSingle()

输出,使用您的示例输入:

123     2011    19.2    0.0     23.9
123     2012    0.0     25.6    0.0
123     2013    21.8    0.0     0.0

我认为你的空值设置为零,因为理论上不存在里程为零的汽车。

附录:除了我已经链接过的Pig文档之外,另一个很好的资源是Alan Gates的 Programming Pig ,截至本文,该文件全部免费提供在线。

答案 1 :(得分:0)

我想我已经通过编写自定义商店功能解决了这个问题。用户提供类似Excel的参数列表:“行标签”,“列标签”和“值”。然后,store函数使用checkSchema获取原始模式信息并将其存储在UDF上下文中。接下来,调用putNext时,将使用列标签和原始值列名创建新列。行标签刚写出来。保留HashSet个新列名称。每次写出值时,如果新列名的数量增加,那么我们将新模式重写为磁盘(删除旧模式后)。

对于值,putNext的几次迭代看起来像这样:

123,2011,19.2
123,2011,,23.9
123,2012,,,25.6
123,2013,21.8,,

这适用于架构:

user,year,Ford_Taurus_mileage
user,year,Ford_Taurus_mileage,Subaru_Forester_mileage
user,year,Ford_Taurus_mileage,Subaru_Forester_mileage,Nissan_Altima_mileage
user,year,Ford_Taurus_mileage,Subaru_Forester_mileage,Nissan_Altima_mileage

最后一行的架构没有变化,因为输入是金牛座的另一个里程记录,我们已经有了专栏。

在写完所有数据后,我们可以使用新架构读回来。不幸的是,这意味着要写出的第一个记录将缺少字段(请参阅上述迭代中的前几行),因此我还必须覆盖LoadFunc方法getNext来调用从applySchema修改的新PigStorage方法。如果元组中的字段数与模式中的字段数不匹配,则此新applySchema方法会将空值附加到元组。例如,在上面的例子中,StoreFunc写的第一行是:

(123,2011,19.2)

但整体架构如下所示:

(user,year,Ford_Taurus_mileage,Subaru_Forester_mileage,Nissan_Altima_mileage)

这意味着第一行缺少两个字段。使用新的LoadFunc,我们附加必要数量的空字段,使元组看起来像这样:

(123,2011,19.2,,)

现在需要做的就是按用户和年份对重新加载的数据进行分组,并取其余3列的平均值来平整事物。