查找网页的唯一身份访问者

时间:2014-02-20 19:25:23

标签: hadoop apache-pig

我想写一个猪脚本,找到访问特定网页的唯一用户ID的数量。

表格定义:a = (userid:chararray, otherid:chararray, webpage:chararray)

这是我写的,但它不起作用

a = (userid:chararray, otherid:chararray, webpage:chararray)
group_by_page = GROUP a by webpage ;
count_d = FOREACH group_by_page GENERATE group, count(distinct(a.userid));

1 个答案:

答案 0 :(得分:1)

您需要在嵌套的foreach中使用DISTINCT;它不是UDF。这应该可以让你到达目的地:

a = LOAD 'input' AS (userid:chararray, otherid:chararray, webpage:chararray);
group_by_page = GROUP a by webpage;
count_d = FOREACH group_by_page { uniq = DISTINCT a.userid; GENERATE group, COUNT(uniq); };

转到here了解有关嵌套foreach的更多信息。