Hadoop Pig Ordered Analytical Functions

时间:2015-07-30 17:21:04

标签: hadoop apache-pig

我是Pig的新手,想要使用有序的分析函数,类似于SQL中可能的。

我的数据看起来像这样:

(stock_symbol,date,stock_price_open,stock_price_close)
(TAC,2001-08-06,16.39,16.36)
(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)

我想要做的是每天查看开盘价的变化。所以,我的输出看起来像这样:

(stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
(TAC,2001-08-06,16.39,16.36,NULL)
(TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)

我希望Pig能够查看当前行前面或后面的行。这是可能的,还是Pig不允许这种类型的分析?

1 个答案:

答案 0 :(得分:0)

您可以使用以下脚本按预期获得输出,但可能需要进行一些精细调整。

A = load '/tmp/pig/test/test' using PigStorage (',');
B= foreach A generate $0 as stock_symbol, ToDate($1,'yyyy-mm-dd') as dt,(double)$2 as stock_price_open, (double)$3 as stock_price_close,'PT24H' as dthour;
C= foreach B generate $0 as stock_symbol, $1 as dt_curr, SubtractDuration($1,$4) as dt_old, $2 as stock_price_open, $3 as stock_price_close;
START = FILTER C BY ($1 == $1);
D = JOIN C by $0 , START by $0;
Filter_D = FILTER D by ((DaysBetween($1,$6)==1) and (DaysBetween($2,$7)==1));
E = foreach Filter_D generate $0 as stock_symbol, $1 as dt, $3 as stock_price_open, $4 as stock_price_close, $3-$8 as stock_price_change;

输出为:

(TAC,2001-01-07T00:08:00.000-08:00,16.3,16.54,-0.08999999999999986)
(TAC,2001-01-08T00:08:00.000-08:00,16.55,16.44,0.25)
(TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
(TAC,2001-01-10T00:08:00.000-08:00,16.5,15.8,0.05000000000000071)

由于我们需要计算一天较旧的开放日期,因此需要变量" PT24H"定义了24小时猪。 使用ToDate()&amp ;;在下一个动作中打印了相同的内容。 SubtractDuration(),接下来是Join和DaysBetween()动作以获得差异。

ToDate(),SubtractDuration(),DaysBetween()是PIG UDF中的inbilt函数,你可以编写合适的UDF来微调相同的脚本,并采取更恰当的行动。