按其他字段过滤内袋(非常数)

时间:2016-01-03 11:52:57

标签: apache-pig

我正在尝试运行以下方案,但失败了。

我从一系列电影开始,并按{年,评级}进行分组。

movies = LOAD '/movies_data.csv'
USING PigStorage(',') AS (id:int, name:chararray, year:int, rating:double, duration:int);

grouped = GROUP movies BY (year, rating);

结果架构是:

DESCRIBE grouped;
grouped: {group: (year: int,rating: double),movies: {(id: int,name: chararray,
year: int,rating: double,duration: int)}}

现在,对于每个组,我想获得一个包含年份的电影名称列表(这是组名称的一部分)。 所以我尝试以下方法:

model = 
    FOREACH grouped {
    listNames = DISTINCT movies.name;
    listNamesFiltered = FILTER listNames BY name MATCHES group::year;
    GENERATE 
        group.year AS year
        ,group.rating AS rating
        ,listNamesFiltered AS listNamesFiltered     
        ,COUNT(listNamesFiltered) AS countNamesFiltered
        ;};

但失败并显示以下消息:

Invalid field projection. Projected field [group::year] does not exist in schema: name:chararray.

使用常量(如下一行):

listNamesFiltered = FILTER listNames BY name MATCHES '.*2010.*';

结果:

(2010,2.6,{(2010: Moby Dick)},1)
(2010,3.8,{(Saturday Night Live: The 2010s)},1)

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:0)

如果您完成所有过滤,然后执行所有GROUP/DISTINCT/COUNT操作,这似乎会轻松得多。

数据

1       2010: Moby Dick                 2010    2.6     128
2       Saturday Night Live: The 2010s  2010    3.8     127
3       2001: A Space Odyssey           2001    4.0     145
4       Forrest Gump                    1994    4.9     334

<强>查询

movies =  LOAD 'movie_data.csv' USING PigStorage(',') AS (id:int, 
                name:chararray, year:int, rating:double, duration:int);
filtered = FILTER movies BY name MATCHES StringConcat('.*', (chararray)year, '.*');
dump filtered;

<强>输出

(1,2010: Moby Dick,2010,2.6,128)
(2,Saturday Night Live: The 2010s,2010,3.8,127)
(3,2001: A Space Odyssey,2001,4.0,145)

然后做你要做的其他事情(COUNT等......)。