我正在尝试运行以下方案,但失败了。
我从一系列电影开始,并按{年,评级}进行分组。
movies = LOAD '/movies_data.csv'
USING PigStorage(',') AS (id:int, name:chararray, year:int, rating:double, duration:int);
grouped = GROUP movies BY (year, rating);
结果架构是:
DESCRIBE grouped;
grouped: {group: (year: int,rating: double),movies: {(id: int,name: chararray,
year: int,rating: double,duration: int)}}
现在,对于每个组,我想获得一个包含年份的电影名称列表(这是组名称的一部分)。 所以我尝试以下方法:
model =
FOREACH grouped {
listNames = DISTINCT movies.name;
listNamesFiltered = FILTER listNames BY name MATCHES group::year;
GENERATE
group.year AS year
,group.rating AS rating
,listNamesFiltered AS listNamesFiltered
,COUNT(listNamesFiltered) AS countNamesFiltered
;};
但失败并显示以下消息:
Invalid field projection. Projected field [group::year] does not exist in schema: name:chararray.
使用常量(如下一行):
listNamesFiltered = FILTER listNames BY name MATCHES '.*2010.*';
结果:
(2010,2.6,{(2010: Moby Dick)},1)
(2010,3.8,{(Saturday Night Live: The 2010s)},1)
非常感谢任何帮助。
答案 0 :(得分:0)
如果您完成所有过滤,然后执行所有GROUP/DISTINCT/COUNT
操作,这似乎会轻松得多。
数据强>:
1 2010: Moby Dick 2010 2.6 128
2 Saturday Night Live: The 2010s 2010 3.8 127
3 2001: A Space Odyssey 2001 4.0 145
4 Forrest Gump 1994 4.9 334
<强>查询强>:
movies = LOAD 'movie_data.csv' USING PigStorage(',') AS (id:int,
name:chararray, year:int, rating:double, duration:int);
filtered = FILTER movies BY name MATCHES StringConcat('.*', (chararray)year, '.*');
dump filtered;
<强>输出强>:
(1,2010: Moby Dick,2010,2.6,128)
(2,Saturday Night Live: The 2010s,2010,3.8,127)
(3,2001: A Space Odyssey,2001,4.0,145)
然后做你要做的其他事情(COUNT
等......)。