如何在猪中总结2个日志文件

时间:2015-10-20 03:33:26

标签: hadoop sum apache-pig

我有问题,总结2个日志文件。

示例文件:

  1. 文件-1

    id用户视图

    1 AAA 2

    2 BBB 5

    3 CCC 9

  2. 文件-2

    id用户视图地址

    1 AAA 5 XXX

    2 BBB 2 YYY

    6 FFF 4 ZZZ

  3. 我希望通过id和sum(view)求和两个文件,我希望输出:

    输出:

    id user view address
    1  AAA  7    XXX
    2  BBB  7    YYY
    

    我应该尝试代码连接两个文件,但我不总结两个文件:

    我的代码:

    inputdata = LOAD '/user/hdfs/tes/part-1' AS (
        id:chararray, 
        user:chararray, 
        view:int
    );
    
    
    inputdata2 = LOAD '/user/hdfs/tes/part-2' AS (
        id:chararray, 
        user:chararray, 
        view:int,
        address:chararray
    );
    
    
    joined = JOIN inputdata BY id LEFT OUTER, inputdata2 by id;
    
    outputlist = FOREACH joined {
    
            GENERATE
            inputdata::id, 
            inputdata::user, 
            --sum(inputdata2::view), 
            inputdata2::address;
    
    
    }
    
    dump outputlist;
    

    我想问一下,如何在两个日志文件中对视图进行求和。??

    感谢。

1 个答案:

答案 0 :(得分:2)

在foreach循环中获取连接结果并总结视图值。这样可以。

A = LOAD 'file1.dat' using PigStorage(' ') AS (a:chararray,b:chararray,c:int);                  
B = LOAD 'file2.dat' using PigStorage(' ') AS (a:chararray,b:chararray,c:int,d:chararray);      
C = JOIN A by a,B by a;                                                                                                                           
D = FOREACH C GENERATE A::a as id,A::b as user,A::c + B::c as view,B::d as address;

输出:

(1,AAA,7,XXX)
(2,BBB,7,YYY)