Pig Latin(在foreach循环中过滤第二个数据源)

时间:2013-06-29 00:01:39

标签: hadoop apache-pig

我有2个数据源。一个包含api调用列表,另一个包含所有相关的身份验证事件。每个Api Call可以有多个Auth事件,我想找到以下的auth事件:
a)包含与Api呼叫相同的“标识符” b)在Api呼叫之后的一秒钟内发生 c)在上述过滤后最接近Api Call。

我原本打算在foreach循环中循环遍历每个ApiCall事件,然后在authenticvents上使用过滤器语句来找到正确的 - 但是,这似乎不可能(USING Filter in a Nested FOREACH in PIG

是否有人能够提出其他方法来实现这一目标。如果它有帮助,这是我试图使用的Pig脚本:

apiRequests = LOAD '/Documents/ApiRequests.txt' AS (api_fileName:chararray, api_requestTime:long, api_timeFromLog:chararray, api_call:chararray, api_leadString:chararray, api_xmlPayload:chararray, api_sourceIp:chararray, api_username:chararray, api_identifier:chararray);
authEvents = LOAD '/Documents/AuthEvents.txt' AS (auth_fileName:chararray, auth_requestTime:long, auth_timeFromLog:chararray, auth_call:chararray, auth_leadString:chararray, auth_xmlPayload:chararray, auth_sourceIp:chararray, auth_username:chararray, auth_identifier:chararray);
specificApiCall = FILTER apiRequests BY api_call == 'CSGetUser';                 -- Get all events for this specific call
match = foreach specificApiCall {                                                -- Now try to get the closest mathcing auth event
        filtered1 = filter authEvents by auth_identifier == api_identifier;      -- Only use auth events that have the same identifier (this will return several)
        filtered2 = filter filtered1 by (auth_requestTime-api_requestTime)<1000; -- Further refine by usings auth events within a second on the api call's tiime
        sorted = order filtered2 by auth_requestTime;                            -- Get the auth event that's closest to the api call
        limited = limit sorted 1;
        generate limited;
        };
dump match;

1 个答案:

答案 0 :(得分:1)

嵌套FOREACH不适用于循环第一个关系时使用第二个关系。这是因为当你的关系中有一个包,并且你想要使用那个包时,就好像它是它自己的关系一样。您无法同时使用apiRequestsauthEvents,除非您先进行某种加入或分组,以便将所需的所有信息整合到一个关系中。

如果您不需要将自己限制在一个授权事件中,那么您的任务在JOINFILTER的概念上非常有效:

allPairs = JOIN specificApiCall BY api_identifier, authEvents BY auth_identifier;
match = FILTER allPairs BY (auth_requestTime-api_requestTime)<1000;

现在所有信息都在一起,您可以GROUP match BY api_identifier后跟嵌套FOREACH来挑选单个事件。

但是,如果您使用COGROUP运算符(例如JOIN但没有交叉产品),您可以在一个步骤中执行此操作 - 您将获得两个包含每个分组记录的行李关系。使用它来挑选最近的授权事件:

cogrp = COGROUP specificApiCall BY api_identifier, authEvents BY auth_identifier;
singleAuth = FOREACH cogrp {
    auth_sorted = ORDER authEvents BY auth_requestTime;
    auth_1 = LIMIT auth_sorted 1;
    GENERATE FLATTEN(specificApiCall), FLATTEN(auth_1);
    };

然后FILTER只留下1秒内的那些:

match = FILTER singleAuth BY (auth_requestTime-api_requestTime)<1000;