使用Apache Pig从文件中读取字符串数组

时间:2015-08-07 00:32:14

标签: arrays hadoop hive apache-pig

我在外部存储Hive表,它是一个非常简单的数据结构。该表在Hive中创建为

(user string, names array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\001'
STORED AS TEXTFILE

(我也试过其他分隔符。)

在Pig中,我似乎无法找到使用bag或tuple来加载简单数组的正确方法!这是我没有运气的尝试:

users = load '<file>' using PigStorage() AS (user:chararray, names:bag{tuple(name:chararray)})

users = load '<file>' using PigStorage() AS (user:chararray, names:chararray)

以及其他一些事情,但我得到的最好的方法是将它们作为单个字符串加载,并删除分隔符(这没有帮助)。我如何只加载一个可变长度的字符串数组?

感谢

1 个答案:

答案 0 :(得分:1)

Let say you have the following data in the /user/hdfs/tester/ip/test file on HDFS

cat test:
1   A,B
2   C,D,E,F
3   G
4   H,I,J,K,L,M

In Pig Mapreduce do the following:

a = LOAD '/user/hdfs/tester/ip/test' USING PigStorage('\t') as (id:INT,names:chararray);
b = FOREACH a GENERATE id, FLATTEN(TOBAG(STRSPLIT(names,','))) as value:tuple(name:CHARARRAY);

The first column is id and value is the tuple of CHARARRAY.