PIG Join包括必须在Bag外部过滤的Bag

时间:2016-04-27 07:50:33

标签: hadoop tuples apache-pig filtering bag

我正在加快PIG的速度,并将来自两个来源的web_log数据和股票定价历史结合起来。日期/时间标准化为时间戳,并对股票代码执行连接。时间戳不匹配。

jnd = JOIN web_time BY w_sym, stock_sort BY group;

该组包含一系列特定于该符号的库存数据。这是组合架构。

jnd:{web_time :: ip:chararray,web_time :: user:chararray,web_time :: w_time:long,web_time :: url:chararray,stock_sort :: sort:{(sym:chararray,time:长,价格:双倍)}}

我需要使用web_time :: w_time和时间过滤stock_sort Bag,它不是完全匹配。示例JND数据如下所示。

(14.192.253.226,voraciouszing, 1213201721000 ,&#34; GET /VLCCF.html HTTP / 1.0&#34;,{(VLCCF, 1265361975000 ,13.84 ),(VLCCF, 1265262560000 ,14.16),(VLCCF, 1265192740000 ,14.44),(VLCCF, 1265099390000 ,14.48),(VLCCF, 1265028034000 ,14.5),(VLCCF, 1262678148000 ,13.76),(VLCCF, 1262607761000 ,13.82),(VLCCF, 1233832497000 < /em>,16.9),(VLCCF,1233740569000,16.96)...,(VLCCF, 884004754000 ,23.99),(VLCCF, 883720431000 ,23.57)})

使用$ 2中的值,最终我需要过滤除一个条目以外的所有条目,但是现在我试图删除时间戳较小的元组。

flake = FOREACH jnd {
    fits = FILTER jnd BY (w_time > time);
    GENERATE ip, user, w_time, url, fits;
    }

以上不起作用,删除时间戳小于所需时间(w_time)的所有Bag元组是第1步。 w_time不是小组的一部分。这真的需要UDF还是我错过了一些简单的东西?我处于停滞状态。

开发环境

Apache Pig版本0.15.0.2.4.0.0-169(rexported) 编译于2016年2月10日,07:50:04 Hadoop 2.7.1.2.4.0.0-169 Subversion git@github.com:hortonworks / hadoop.git -r 26104d8ac833884c8776473823007f17 4节点Hortonworks集群

赞赏任何意见。

1 个答案:

答案 0 :(得分:0)

我认为在你的foreach中,你需要过滤stock_sort :: sort。不是JND。过滤应该通过jnd.w_time&gt;完成。时间。我设法编写了整个流程;没有UDF。见下文。

拿了两个文件:

xact.txt:

VLCCF,1265361975000,13.84
VLCCF,1265262560000,14.16
VLCCF,1265192740000,14.44
VLCCF,1265099390000,14.48
VLCCF,1265028034000,14.5
VLCCF,1262678148000,13.76
VLCCF,1262607761000,13.82
VLCCF,1233832497000,16.9
VLCCF,1233740569000,16.96
VLCCF,884004754000,23.99
VLCCF,883720431000,23.5

stock.txt中

14.192.253.226,voraciouszing,1213201721000,&#34; GET /VLCCF.html HTTP / 1.0&#34;,VLCCF

stock = load 'stock.txt' using PigStorage(',') as (
ip:chararray,
user:chararray,
w_time:long,
url:chararray,
symbol:chararray
);

xact = load 'xact.txt' using PigStorage(',') as (
symbol:chararray,
time:long,
price:double
);

xact_grouped = foreach(group xact by symbol) generate
    group, xact;

joined = join stock by symbol, xact_grouped by group;

filtered = foreach joined {
    grp = filter xact by time < joined.w_time;
    generate ip, grp;
};

dump filtered;

给我

(14.192.253.226,{(VLCCF,884004754000,23.99),(VLCCF,883720431000,23.5)})

编辑:或者

stock = load 'stock.txt' using PigStorage(',') as (
ip:chararray,
user:chararray,
w_time:long,
url:chararray,
symbol:chararray
);

xact = load 'xact.txt' using PigStorage(',') as (
symbol:chararray,
time:long,
price:double
);

joined = join stock by symbol, xact by symbol;

joined_filtered = foreach (filter joined by time < w_time) generate
    ip as ip,
    user as user,
    w_time as w_time,
    stock::symbol as symbol,
    time as time,
    price as price;

grouped = foreach (group joined_filtered by (ip, user, w_time)) generate
    flatten(group),
    joined_filtered;