我正在学习Apache Pig,我正在尝试处理数据集,如下所示。它有电影名称及其类型列表。分别为长空格或多个标签。
数据集示例
"!Next?" (1994) Documentary
"#1 Single" (2006) Reality-TV
"#1MinuteNightmare" (2014) Horror
"#30Nods" (2014) Drama
"#7DaysLater" (2013) Comedy
"#ATown" (2014) Comedy
"#Actress" (2015) Comedy
"#Adulthood" (????) Comedy
"#Adulting" (2015) Comedy
"#AwkwardMornings" (2014) Comedy
"#Bandcamp" (2014) Musical
"#Besties" (2014) Comedy
当我尝试加载数据集时,它只加载电影名称部分,如下所示
加载命令
grunt> X = LOAD '/home/padhu/Downloads/smallgenre.txt' AS (line:chararray);
输出:
("!Next?" (1994))
("#1 Single" (2006))
("#1MinuteNightmare" (2014))
("#30Nods" (2014))
("#7DaysLater" (2013))
我也尝试过如下,但获得与上面相同的输出
X = LOAD '/home/padhu/Downloads/smallgenre.txt' AS (line:chararray,line2:Chararray);
我尝试了这样并部分获得输出,如果电影名称本身包含空格,我只获得电影名称。
X = LOAD '/home/padhu/Downloads/smallgenre.txt' USING PigStorage(' ') AS (line:chararray,line2:Chararray);
我希望输出如下所示
(MovieName,genre)
是否有写PIG语句忽略Movie名称中的空格并将Movie name和Space之间的空格视为delemeter?
在发布此内容之前,我确实在google和SO中进行了搜索,但没有任何帮助。
答案 0 :(得分:0)
这个脚本不优雅但对我有用:
A = LOAD '/home/padhu/Downloads/smallgenre.txt' USING TextLoader() as (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, ' ',2)) AS (MovieName:CHARARRAY, Type:CHARARRAY);
C = FOREACH B GENERATE MovieName, LTRIM(Type) AS Genre;
DUMP C;