Apache Pig:处理长空间

时间:2015-11-06 12:32:54

标签: apache-pig

我正在学习Apache Pig,我正在尝试处理数据集,如下所示。它有电影名称及其类型列表。分别为长空格或多个标签。

数据集示例

"!Next?" (1994)                     Documentary
"#1 Single" (2006)                  Reality-TV
"#1MinuteNightmare" (2014)              Horror
"#30Nods" (2014)                    Drama
"#7DaysLater" (2013)                    Comedy
"#ATown" (2014)                     Comedy
"#Actress" (2015)                   Comedy
"#Adulthood" (????)                 Comedy
"#Adulting" (2015)                  Comedy
"#AwkwardMornings" (2014)               Comedy
"#Bandcamp" (2014)                  Musical
"#Besties" (2014)                   Comedy

当我尝试加载数据集时,它只加载电影名称部分,如下所示

加载命令

grunt> X = LOAD '/home/padhu/Downloads/smallgenre.txt' AS (line:chararray);

输出:

("!Next?" (1994))
("#1 Single" (2006))
("#1MinuteNightmare" (2014))
("#30Nods" (2014))
("#7DaysLater" (2013))

我也尝试过如下,但获得与上面相同的输出

X = LOAD '/home/padhu/Downloads/smallgenre.txt' AS (line:chararray,line2:Chararray);

我尝试了这样并部分获得输出,如果电影名称本身包含空格,我只获得电影名称。

X = LOAD '/home/padhu/Downloads/smallgenre.txt' USING PigStorage(' ') AS (line:chararray,line2:Chararray);

我希望输出如下所示

(MovieName,genre)

是否有写PIG语句忽略Movie名称中的空格并将Movie name和Space之间的空格视为delemeter?

在发布此内容之前,我确实在google和SO中进行了搜索,但没有任何帮助。

1 个答案:

答案 0 :(得分:0)

这个脚本不优雅但对我有用:

 A = LOAD '/home/padhu/Downloads/smallgenre.txt' USING TextLoader() as (line:chararray);
 B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '      ',2)) AS (MovieName:CHARARRAY, Type:CHARARRAY);
 C = FOREACH B GENERATE MovieName, LTRIM(Type) AS Genre;
 DUMP C;