以下是由双冒号(::)分隔的样本数据集。
1::Toy Story (1995)::Animation|Children's|Comedy
我想从上面的数据集中提取三个字段作为movieID,title和genre。我已经为此编写了以下代码
movies = LOAD 'location/of/dataset/on/hdfs '
using PigStorage('::')
as
(MovieID:int,title:chararray,genre:chararray);
但我收到了以下错误
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<file script.pig, line 1, column 9> pig script failed to validate:
java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[::]'
答案 0 :(得分:3)
使用MyRegExloader:你需要piggybank.jar。
REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('([^\\:]+)::([^\\:]+)::([^\\:]+)')
as (movieid:int, title:chararray, genre:chararray);
输出:
(1,玩具总动员(1995),动画|儿童喜剧)