我在外部存储Hive表,它是一个非常简单的数据结构。该表在Hive中创建为
(user string, names array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\001'
STORED AS TEXTFILE
(我也试过其他分隔符。)
在Pig中,我似乎无法找到使用bag或tuple来加载简单数组的正确方法!这是我没有运气的尝试:
users = load '<file>' using PigStorage() AS (user:chararray, names:bag{tuple(name:chararray)})
users = load '<file>' using PigStorage() AS (user:chararray, names:chararray)
以及其他一些事情,但我得到的最好的方法是将它们作为单个字符串加载,并删除分隔符(这没有帮助)。我如何只加载一个可变长度的字符串数组?
感谢
答案 0 :(得分:1)
Let say you have the following data in the /user/hdfs/tester/ip/test file on HDFS
cat test:
1 A,B
2 C,D,E,F
3 G
4 H,I,J,K,L,M
In Pig Mapreduce do the following:
a = LOAD '/user/hdfs/tester/ip/test' USING PigStorage('\t') as (id:INT,names:chararray);
b = FOREACH a GENERATE id, FLATTEN(TOBAG(STRSPLIT(names,','))) as value:tuple(name:CHARARRAY);
The first column is id and value is the tuple of CHARARRAY.