我最近在工作中遇到了这个问题,这是关于猪的扁平化。我用一个简单的例子来表达它
两个文件
===文件1 ===
1_A
2_b
4_d
=== file2(tab seperated)===
1 a
2 b
3 c
猪脚本1:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
c = join a1 by num, b by num;
dump c; -- exception java.lang.String cannot be cast to java.lang.Integer
猪脚本2:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2)) as (num:int, ch:chararray);
a2 = foreach a1 generate (int)num as num, ch as ch;
c = join a2 by num, b by num;
dump c; -- exception java.lang.String cannot be cast to java.lang.Integer
猪脚本3:
a = load 'file1' as (str:chararray);
b = load 'file2' as (num:int, ch:chararray);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
a2 = foreach a1 generate (int)$0 as num, $1 as ch;
c = join a2 by num, b by num;
dump c; -- right
我不知道为什么脚本1,2是错的,脚本3是对的,我也想知道是否有更简洁的表达式来获得关系c,thx。
答案 0 :(得分:4)
您是否有任何特殊原因未使用PigStorage?因为它可以让你的生活变得如此简单:)。
a = load '/file1' USING PigStorage('_') AS (num:int, char:chararray);
b = load '/file2' USING PigStorage('\t') AS (num:int, char:chararray);
c = join a by num, b by num;
dump c;
另请注意,在file1中,您使用下划线作为分隔符,但是您将“ - ”作为STRSPLIT的参数。
修改强> 我花了一些时间在你提供的脚本上;脚本1& 2确实不起作用,脚本3也像这样工作(没有额外的foreach):
a = load 'file1' as (str:chararry);
b = load 'file2' as (num:int, ch:chararry);
a1 = foreach a generate flatten(STRSPLIT(str,'_',2));
c = join a1 by (int)($0), b by num;
dump c;
至于问题的根源,我会猜测并说它可能与此相关(as stated in Pig Documentation)并结合了猪的运行周期优化:
如果FLATTEN包含空内部模式的包,则结果关系的模式为空。
在您的情况下,我相信STRSPLIT结果的模式在运行时之前是未知的。
<强> EDIT2:强> 好的,这是我的理论解释:
This is the complete -explain- output for script 2和this is for script 3。我将在这里粘贴有趣的部分。
|---a2: (Name: LOForEach Schema: num#288:int,ch#289:chararray)
| | |
| | (Name: LOGenerate[false,false] Schema: num#288:int,ch#289:chararray)ColumnPrune:InputUids=[288, 289]ColumnPrune:OutputUids=[288, 289]
| | | |
| | | (Name: Cast Type: int Uid: 288)
| | | |
| | | |---num:(Name: Project Type: int Uid: 288 Input: 0 Column: (*))
上面的部分是针对脚本2;看到最后一行。它假设flatten(STRSPLIT)
的输出将具有类型integer
的第一个元素(因为您以这种方式提供了模式)。但实际上STRSPLIT
有一个null
输出模式,被视为bytearray
个字段;所以flatten(STRSPLIT)
的输出实际上是(n:bytearray, c:bytearray)
。因为你提供了一个模式,所以pig会尝试将java转换(到a1
的输出)到num
字段;失败为num
实际上是一个表示为bytearray的java String
。由于这个java-cast失败了,pig甚至没有尝试在上面的行中进行显式转换。
让我们看看脚本3的情况:
|---a2: (Name: LOForEach Schema: num#85:int,ch#87:bytearray)
| | |
| | (Name: LOGenerate[false,false] Schema: num#85:int,ch#87:bytearray)ColumnPrune:InputUids=[]ColumnPrune:OutputUids=[85, 87]
| | | |
| | | (Name: Cast Type: int Uid: 85)
| | | |
| | | |---(Name: Project Type: bytearray Uid: 85 Input: 0 Column: (*))
请参见最后一行,此处a1
的输出已正确视为bytearray
,此处没有任何问题。现在看看倒数第二行; pig尝试(并成功)从bytearray
到integer
进行显式的强制转换操作。