我正在尝试按组计算记录之间的差异,并且还包括按行分组的行数。这可以使用窗口函数在HIVE中使用滞后和行号函数来完成。尝试使用PIG和python UDF重新创建它。
在下面的示例中,我需要为每个名称从1重新开始行号,并为新月(新记录)增加。另外,我需要每个名字的上个月余额差异。
输入数据
name month balance
A 1 10
A 2 5
A 3 15
B 2 20
B 3 10
B 4 45
B 5 50
输出数据
name month balance row_number balance_diff
A 1 10 1 0
A 2 5 1 -5
A 3 15 3 10
B 2 20 1 0
B 3 10 2 -10
B 4 45 3 35
B 5 50 4 5
如何使用PIG和python UDF执行此操作?以下是我的尝试。
PIG
output = foreach (group input by (name)) {
sorted = order input BY month asc;
row_details= myudf.rownum_and_diff(sorted.(month, balance));
generate flatten (sorted), flatten (row_details));
};
Python UDF
def row_num(mth):
return [x+1 for x,y in enumerate (mth)]
def diff(bal, n=1):
return [x-y if (x is not None and y is not None) else 0.0 \
for x,y in zip(bal, [:n] + bal)]
@outputSchema('udfbag:bag{udftuple:tuple(row_number: int, balance_diff: int)}')
def row_metrics(mthbal):
mth, bal = zip(*mthbal)
row_number = row_num(mth)
balance_diff = diff(bal)
return zip(row_number, balance_diff)
我的python函数有效。但是,一旦我将结果导入PIG,我在组合两个包(sorted和row_detail)时遇到了麻烦。非常感谢任何帮助。
我也看到PIG中的枚举函数用行号做我想做的事情。但是,作为学习PIG的一部分,我正在寻找使用python UDF的解决方案。
答案 0 :(得分:0)
试试这个。
Python UDF:
def row_num(mth):
return [x+1 for x,y in enumerate (mth)]
def diff(bal, n=1):
return [0]+[x-y for x,y in zip(bal[n:],bal[:-n])]
@outputSchema('udfbag:bag{udftuple:tuple(name: chararray, mth: int, row_number: int, balance_diff: int)}')
def row_metrics(mthbal):
name, mth, bal = zip(*mthbal)
row_number = row_num(mth)
balance_diff = diff(bal)
return zip(name,mth,row_number, balance_diff)
Pig Script:
register 'myudf.py' using jython as myudf;
inpdat = load 'input.dat' using PigStorage(',') as (name:chararray, month:int, balance:int);
outdat = foreach (group inpdat by name) {
sorted = order inpdat BY month asc;
row_details = myudf.row_metrics(sorted);
generate flatten (row_details);
};
dump outdat;
答案 1 :(得分:0)
在我的案例中使用piggybank的缝合功能。有兴趣了解其他任何方法。
REGISTER /mypath/piggybank.jar;
define Stitch org.apache.pig.piggybank.evaluation.Stitch;
input = load 'input.dat' using PigStorage(',') as (name:chararray, month:int, balance:int);
output = FOREACH (group input by name) {
sorted = ORDER input by month asc;
udf_fields = myudf.row_metrics(sorted.(month, balance));
generate flatten(Stitch(sorted,udf_fields)) as (name, month, balance, row_number, balance_diff);
};