如何划分猪的不同表格中的数字

时间:2016-03-29 03:21:04

标签: hadoop apache-pig

我正在尝试连接两个表,并将一个表中的数字除以另一个表中的数字。我试图在原始版本中执行此操作并生成具有相同值的新表,但我两次都得到相同的错误,这对我来说更加困惑。

--get the data 
lines = LOAD '/historicaldata.csv' USING PigStorage(' ') AS (ticker:chararray, date:long, open:long, high:long, low:long, close:long, volume:long);

--limit it between the dates we want
specDates = FILTER lines BY (date<=20000103 and date>=19900101);

--sort by ticker symbol
companies = GROUP specDates BY ticker;

--sort DESC and get the top to get the ending date
sorted_end = FOREACH companies {
    sorted1 = ORDER specDates BY date DESC;
    endDate = LIMIT sorted1 1;
    GENERATE endDate.ticker AS ticker, endDate.open AS open, endDate.close AS close;
}

--sort ASC and get the top to get the starting date
sorted_begin = FOREACH companies {
    sorted2 = ORDER specDates BY date ASC;
    startDate = LIMIT sorted2 1;
    GENERATE startDate.ticker AS ticker, startDate.open AS open, startDate.close AS close;
}

joined = JOIN sorted_end BY ticker, sorted_begin BY ticker;
final = FOREACH joined GENERATE sorted_end::ticker as ticker, sorted_begin::open as open, sorted_end::close as close;
final2 = FOREACH final GENERATE ticker as ticker, (float)(close/open) as growth_factor;

我不断得到的错误是:

(Name: Divide Type: null Uid: null)incompatible types in Divide Operator left hand side:bag :tuple(close:float)  right hand side:bag :tuple(open:float) 

两者都是漂浮物,所以我不确定为什么它们是“不兼容的类型”,而不是它们来自不同的袋子,但是将它们添加到“最终”并试图从那里开始不起作用。

数据格式为:

AA,20140131,11.60,11.80,11.45,11.48,33014100
AA,20140130,12.05,12.07,11.83,11.92,23223500
AA,20140129,11.64,12.23,11.58,11.96,44433000

每个条目都包含所有列,并且格式正确,非零数字

1 个答案:

答案 0 :(得分:0)

根据您的查询,我尝试在我的系统上创建一个虚拟表并生成结果。我发现没有问题,分工操作成功完成。 PFB在Pig上点了一些示例查询: -

A = LOAD '/home/training/716391/pig/pigdata.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
B = LOAD '/home/training/716391/pig/pigdata2.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
C = join A by ID, B by ID
D = FOREACH C generate A::ID as IDA, A::name as NAMEA, A::GPC as GPCA, B::ID as IDB, B::name as NAMEB, B::GPC as GPCB;
E = FOREACH D GENERATE IDA, (FLOAT)(GPCA/GPCB) AS VALUE;

如果你的案件中的除数值没有Null值或0,那么请你确认一下吗?

请问你能分享sorted_end和sorted_begin的加载语句吗?