猪:只计算特定行

时间:2016-11-03 15:36:33

标签: count group-by apache-pig

我的数据包含locationsentimentbrand字段。我想计算一个品牌位置的正数,负数和中性数。

假设x有数据,我做了:

a1 = GROUP x BY (location, brand);
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (location, brand), COUNT(x.sentiment=="positive"?1:0) AS positive_count, COUNT(x.sentiment=="negative"?1:0) AS negative_count, COUNT(x.sentiment=="neutral:?1:0) as neutral_count;

但我收到语法错误Unexpected character '"'

我尝试了所有三个分组:location, sentiment and brand但我只得到总体计数:

{location: "newyork", brand: "pampers", sentiment = "positive", count = 10}
{location: "newyork", brand: "pampers", sentiment = "negative", count = 2}
{location: "newyork", brand: "pampers", sentiment = "neutral", count = 20}

我想要positives_count,negatives_count和neutrals_count的单独字段。像这样:

{location: "newyork", brand: "pampers", positive_count = 10, negative_count = 2, neutral_count = 20}
{location: "london", brand: "pampers", positive_count = 12, negative_count = 0, neutral_count = 35}
{location: "newyork", brand: "huggies", positive_count = 40, negative_count = 6, neutral_count = 10}

有人可以帮帮我吗?

2 个答案:

答案 0 :(得分:0)

使用单引号

a1 = GROUP x BY (location, brand);
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (location, brand), 
                    COUNT(x.sentiment=='positive'?1:0) AS positive_count, 
                    COUNT(x.sentiment=='negative'?1:0) AS negative_count, 
                    COUNT(x.sentiment=='neutral'?1:0) as neutral_count;

修改

newyork pampers positive
newyork pampers positive
newyork pampers negative
newyork pampers positive
newyork pampers positive
newyork pampers neutral
newyork pampers positive
newyork pampers negative
newyork pampers neutral
newyork pampers positive
newyork pampers positive
newyork pampers neutral

<强>脚本

B = GROUP A BY (location,brand);
C = FOREACH B  { 
                  A1 = FILTER A BY sentiment matches 'positive';
                  A2 = FILTER A BY sentiment matches 'negative';
                  A3 = FILTER A BY sentiment matches 'neutral';
                  GENERATE FLATTEN(group) as (location,brand),COUNT(A1),COUNT(A2),COUNT(A3);
               };

<强>输出

enter image description here

答案 1 :(得分:0)

我过滤了包含原始数据的别名,并计算了每个条目数并将它们全部加入。

j = pow(p1, k1)

有点冗长但有效。