Question

我正在加快PIG的速度，并将来自两个来源的web_log数据和股票定价历史结合起来。日期/时间标准化为时间戳，并对股票代码执行连接。时间戳不匹配。

jnd = JOIN web_time BY w_sym, stock_sort BY group;

该组包含一系列特定于该符号的库存数据。这是组合架构。

jnd：{web_time :: ip：chararray，web_time :: user：chararray，web_time :: w_time：long，web_time :: url：chararray，stock_sort :: sort：{（sym：chararray，time：长，价格：双倍）}}

我需要使用web_time :: w_time和时间过滤stock_sort Bag，它不是完全匹配。示例JND数据如下所示。

（14.192.253.226，voraciouszing， 1213201721000 ，＆＃34; GET /VLCCF.html HTTP / 1.0＆＃34;，{（VLCCF， 1265361975000 ，13.84 ），（VLCCF， 1265262560000 ，14.16），（VLCCF， 1265192740000 ，14.44），（VLCCF， 1265099390000 ，14.48），（VLCCF， 1265028034000 ，14.5），（VLCCF， 1262678148000 ，13.76），（VLCCF， 1262607761000 ，13.82），（VLCCF， 1233832497000 < /em>,16.9),(VLCCF,1233740569000,16.96）...，（VLCCF， 884004754000 ，23.99），（VLCCF， 883720431000 ，23.57）}）

使用$ 2中的值，最终我需要过滤除一个条目以外的所有条目，但是现在我试图删除时间戳较小的元组。

flake = FOREACH jnd { fits = FILTER jnd BY (w_time > time); GENERATE ip, user, w_time, url, fits; }

以上不起作用，删除时间戳小于所需时间（w_time）的所有Bag元组是第1步。 w_time不是小组的一部分。这真的需要UDF还是我错过了一些简单的东西？我处于停滞状态。

开发环境

Apache Pig版本0.15.0.2.4.0.0-169（rexported）编译于2016年2月10日，07：50：04 Hadoop 2.7.1.2.4.0.0-169 Subversion git@github.com:hortonworks / hadoop.git -r 26104d8ac833884c8776473823007f17 4节点Hortonworks集群

赞赏任何意见。

Answer 1

我认为在你的foreach中，你需要过滤stock_sort :: sort。不是JND。过滤应该通过jnd.w_time＆gt;完成。时间。我设法编写了整个流程;没有UDF。见下文。

拿了两个文件：

xact.txt：

VLCCF,1265361975000,13.84
VLCCF,1265262560000,14.16
VLCCF,1265192740000,14.44
VLCCF,1265099390000,14.48
VLCCF,1265028034000,14.5
VLCCF,1262678148000,13.76
VLCCF,1262607761000,13.82
VLCCF,1233832497000,16.9
VLCCF,1233740569000,16.96
VLCCF,884004754000,23.99
VLCCF,883720431000,23.5

stock.txt中

14.192.253.226，voraciouszing，1213201721000，＆＃34; GET /VLCCF.html HTTP / 1.0＆＃34;，VLCCF

stock = load 'stock.txt' using PigStorage(',') as (
ip:chararray,
user:chararray,
w_time:long,
url:chararray,
symbol:chararray
);

xact = load 'xact.txt' using PigStorage(',') as (
symbol:chararray,
time:long,
price:double
);

xact_grouped = foreach(group xact by symbol) generate
    group, xact;

joined = join stock by symbol, xact_grouped by group;

filtered = foreach joined {
    grp = filter xact by time < joined.w_time;
    generate ip, grp;
};

dump filtered;

给我

（14.192.253.226，{（VLCCF，884004754000,23.99），（VLCCF，883720431000,23.5）}）

编辑：或者

stock = load 'stock.txt' using PigStorage(',') as (
ip:chararray,
user:chararray,
w_time:long,
url:chararray,
symbol:chararray
);

xact = load 'xact.txt' using PigStorage(',') as (
symbol:chararray,
time:long,
price:double
);

joined = join stock by symbol, xact by symbol;

joined_filtered = foreach (filter joined by time < w_time) generate
    ip as ip,
    user as user,
    w_time as w_time,
    stock::symbol as symbol,
    time as time,
    price as price;

grouped = foreach (group joined_filtered by (ip, user, w_time)) generate
    flatten(group),
    joined_filtered;

PIG Join包括必须在Bag外部过滤的Bag

开发环境

1 个答案: