我有两个文件,
一个是titles.csv,并且具有以下格式的电影ID和标题:
999: Title
734: Another_title
另一个是链接到电影的用户ID列表
categoryID:user1_id,....
222: 120
227: 414 551
249: 555
不同尺寸(每种类型最少一个用户)
目标是首先解析字符串,使它们分成两个(对于两个文件),所有内容都在':'之前。以及之后的一切。
我试过这样做
movies = LOAD .... USING PigStorage('\n') AS (line: chararray)
users = LOAD .... USING PigStorage('\n') AS (line: chararray)
-- parse 'users'/outlinks, make a list and count fields
tokenized = FOREACH users GENERATE FLATTEN(TOKENIZE(line, ':')) AS parameter;
filtered = FILTER tokenized BY INDEXOF(parameter, ' ') != -1;
result = FOREACH filtered GENERATE SUBSTRING(parameter, 2, (int)SIZE(parameter)) AS number;
但这是我陷入困境/困惑的地方。想法?
我还应该输出字符串第二部分中用户ID最多的前10个条目。
答案 0 :(得分:1)
试试这个
movies = LOAD 'file1' AS titleLine;
A = FOREACH movies GENERATE FLATTEN(REGEX_EXTRACT_ALL(titleLine,'^(.*):\\s+(.*)$')) AS (movieId:chararray,title:chararray);
users = LOAD 'file2' AS userLine;
B = FOREACH users GENERATE FLATTEN(REGEX_EXTRACT_ALL(userLine,'^(.*):\\s+(.*)$')) AS (categoryId:chararray,userId:chararray);
<强>输出1:强>
(999,Title)
(734,Another_title)
<强>输出2:强>
(222,120)
(227,414 551)
(249,555 )