我有一个加入猪的问题。我将首先向您介绍背景。这是我的代码:
-- START file loading
start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (PARTRANGE:chararray, COD_IPUSER:chararray);
-- trim
A = FOREACH start_file GENERATE TRIM(PARTRANGE) AS PARTRANGE, TRIM(COD_IPUSER) AS COD_IPUSER;
dump A;
给出输出:
(79.92.147.88,20140310)
(79.92.147.88,20140310)
(109.31.67.3,20140310)
(109.31.67.3,20140310)
(109.7.229.143,20140310)
(109.8.114.133,20140310)
(77.198.79.99,20140310)
(77.200.174.171,20140310)
(77.200.174.171,20140310)
(109.17.117.212,20140310)
加载其他文件:
-- Chargement du fichier recherche Hadopi
file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (IP_RECHERCHEE:chararray, DATE_HADO:chararray);
dump file2;
输出是这样的:
(2014/03/10 00:00:00,79.92.147.88)
(2014/03/10 00:00:01,79.92.147.88)
(2014/03/10 00:00:00,192.168.2.67)
现在,我想做一个左外连接。这是代码:
result = JOIN file2 by IP_RECHERCHEE LEFT OUTER, A by COD_IPUSER;
dump result;
输出是这样的:
(2014/03/10 00:00:00,79.92.147.88,,)
(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:01,79.92.147.88,,)
" file2"的所有记录在这里,这很好,但任何start_file都在这里。就像连接失败一样。
你知道问题出在哪里吗?
感谢。
答案 0 :(得分:2)
您在file2
中错误地标记了字段。您将第一个字段称为IP,将第二个字段称为日期,此时,如dump
所示,情况恰恰相反。试试FOREACH file2 GENERATE IP_RECHERCHEE
,您会看到您尝试加入的字段。
答案 1 :(得分:1)
结果如预期。您正在调用Left outer join,它查找file2中的IP_RECHERCHEE字段与A的COD_IPUSER的匹配。
由于没有匹配,它返回file2中的所有IP_RECHERCHEE字段,并将null替换为来自的字段甲。的
显然2014/03/10 00:00:00 != 20140310
答案 2 :(得分:1)
你的领域'名称错误,您加入了错误的字段。您似乎想通过IP地址加入。
start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (IP:chararray, PARTRANGE:chararray);
A = FOREACH start_file GENERATE TRIM(IP) AS IP, TRIM(PARTRANGE) AS PARTRANGE;
file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (DATE_HADO:chararray, IP:chararray);
我得到的是这个
(2014/03/10 00:00:00,192.168.2.67,,)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)