使用猪的连接错误

时间:2014-10-11 06:42:55

标签: join apache-pig

我使用了以下命令

X1 = LOAD '/PIG10/' using PigStorage(',') as (statename:chararray,district:chararray,code:chararray,ru:chararray);
Y1 = LOAD '/POP2/' using PigStorage(',') as (district:chararray,r_u:chararray);

我在X1中有四列数据

(JAMMU & KASHMIR    JAMMU & KASHMIR 00000   Total,,,)
(JAMMU & KASHMIR    JAMMU & KASHMIR 00000   Rural,,,)
(JAMMU & KASHMIR    JAMMU & KASHMIR 00000   Urban,,,)
(JAMMU & KASHMIR    Kupwara 00000   Total,,,)
(JAMMU & KASHMIR    Kupwara 00000   Rural,,,)
(JAMMU & KASHMIR    Kupwara 00000   Urban,,,)
(JAMMU & KASHMIR    Badgam  00000   Total,,,)
(JAMMU & KASHMIR    Badgam  00000   Rural,,,)
(JAMMU & KASHMIR    Badgam  00000   Urban,,,)
(JAMMU & KASHMIR    Leh(Ladakh) 00000   Total,,,)
(JAMMU & KASHMIR    Leh(Ladakh) 00000   Rural,,,)
(JAMMU & KASHMIR    Leh(Ladakh) 00000   Urban,,,)
(JAMMU & KASHMIR    Kargil  00000   Total,,,)
(JAMMU & KASHMIR    Kargil  00000   Rural,,,)
(JAMMU & KASHMIR    Kargil  00000   Urban,,,)
(JAMMU & KASHMIR    Punch   00000   Total,,,)
(JAMMU & KASHMIR    Punch   00000   Rural,,,)

在Y1中如下

(JAMMU & KASHMIR    Total,)
(JAMMU & KASHMIR    Rural,)
(JAMMU & KASHMIR    Urban,)
(Kupwara    Total,)
(Kupwara    Rural,)
(Kupwara    Urban,)
(Badgam Total,)
(Badgam Rural,)
(Badgam Urban,)
(Leh(Ladakh)    Total,)
(Leh(Ladakh)    Rural,)
(Leh(Ladakh)    Urban,)
(Kargil Total,)
(Kargil Rural,)
(Kargil Urban,)
(Punch  Total,)
(Punch  Rural,)
(Punch  Urban,)
(Rajouri    Total,)
(Rajouri    Rural,)
(Rajouri    Urban,)

我使用了join C2 =按地区加入X1,按地区加入Y1; 但我无法得到输出

1 个答案:

答案 0 :(得分:1)

原因是,所有输入都被加载到第一列,而X1中的剩余3列(区,代码,ru)和Y1中的1列(r_u)为空。 它看起来像分隔符','不适合您的输入数据。你能粘贴文件PIG10和POP2的实际输入格式吗?

    Solution:
    Try this script, the below regex is written based on the above input only.
    X = LOAD '/PIG10/' AS line;
    Y = LOAD '/POP2/' AS line1;
    X1 = FOREACH X GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '(\\w+|\\w+\\s+&\\s+\\w+)\\s+([a-zA-Z()]+|\\w+\\s+&\\s+\\w+)\\s+(\\w+)\\s+(\\w+)')) AS (statename:chararray,district:chararray,code:chararray,ru:chararray);
    Y1 = FOREACH Y GENERATE FLATTEN(REGEX_EXTRACT_ALL(line1, '([a-zA-Z()]+|\\w+\\s+&\\s+\\w+)\\s+(\\w+)')) AS (district:chararray,r_u:chararray);
    C2 = join X1 by district,Y1 by district;
    DUMP C2;

Sample output:
(JAMMU & KASHMIR,Punch,00000,Total,Punch,Rural)
(JAMMU & KASHMIR,Punch,00000,Total,Punch,Urban)
(JAMMU & KASHMIR,Badgam,00000,Urban,Badgam,Rural)
(JAMMU & KASHMIR,Badgam,00000,Urban,Badgam,Total)
(JAMMU & KASHMIR,Badgam,00000,Urban,Badgam,Urban)
(JAMMU & KASHMIR,Leh(Ladakh),00000,Urban,Leh(Ladakh),Rural)
(JAMMU & KASHMIR,Leh(Ladakh),00000,Urban,Leh(Ladakh),Total)
(JAMMU & KASHMIR,Leh(Ladakh),00000,Urban,Leh(Ladakh),Urban)
(JAMMU & KASHMIR,JAMMU & KASHMIR,00000,Rural,JAMMU & KASHMIR,Urban)
(JAMMU & KASHMIR,JAMMU & KASHMIR,00000,Rural,JAMMU & KASHMIR,Rural)