我试图将一些数据加载到Pig:
记录:
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
使用的脚本:
loadMoviesDs = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
USING PigStorage(':')
AS (Movieid:long, dummy1, Title:chararray, dummy2, Genere:chararray);
输出
11,,American President, The (1995),,Comedy|Drama|Romance
12,,Dracula,, Dead and Loving It (1995)
Dracula之后如何处理结肠(:) .-?
由于冒号,第二列被分成2列,因为我们总共有3列,所以movieid 12 comedy|horror
的最后一列不会被加载。
答案 0 :(得分:1)
您可以使用REGEX_EXTRACT_ALL
实现此目的。
以下是一段代码,实现了这一目标:
A = LOAD '/Users/Prateek/Downloads/ml-10M100K/movies.dat'
AS (f1:chrarray);
B = FOREACH A GENERATE REGEX_EXTRACT_ALL(f1, '(.*)::(.*)::(.*)');
C = FOREACH B GENERATE FLATTEN($0);
D = FOREACH C GENERATE $0 AS (MovieID:long), $1 AS (Title:chararray), $2 AS (Genre:chararray);
DUMP D;
我得到了以下输出(这是一个元组)。 ":"之后" Dracula"完好无损。
(11,American President, The (1995),Comedy|Drama|Romance)
(12,Dracula: Dead and Loving It (1995),Comedy|Horror)