将数据从Movielens加载到pig中的问题

时间:2015-12-13 05:38:57

标签: apache-pig hdfs bigdata

我试图将一些数据加载到Pig:

记录:

11::American President, The (1995)::Comedy|Drama|Romance

12::Dracula: Dead and Loving It (1995)::Comedy|Horror

使用的脚本:

loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat' 
               USING PigStorage(':') 
               AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);

输出

 11,,American President, The (1995),,Comedy|Drama|Romance
 12,,Dracula,, Dead and Loving It (1995)

Dracula之后如何处理结肠(:) .-?

由于冒号,第二列被分成2列,因为我们总共有3列,所以movieid 12 comedy|horror的最后一列不会被加载。

1 个答案:

答案 0 :(得分:1)

您可以使用REGEX_EXTRACT_ALL实现此目的。

以下是一段代码,实现了这一目标:

A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat' 
               AS (f1:chrarray); 
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN($0);
D = FOREACH C GENERATE $0 AS (MovieID:long), $1 AS (Title:chararray), $2 AS (Genre:chararray);
DUMP D;

我得到了以下输出(这是一个元组)。 ":"之后" Dracula"完好无损。

(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)