我问this question有关如何使用SQL获取不同用户的滚动计数,但我也有Hadoop可供使用,现在我想知道这种分析是否更适合Hadoop。不幸的是,我是Hadoop的新手,所以除了获取数据和最基本的MapReduce工作之外,我对如何处理这个问题一无所知。假设这是Hadoop的一个很好的候选者,最好的方法是什么?
答案 0 :(得分:0)
在地图中对此进行建模的一种可能方法是:
在Pig中,除了保持运行总计之外的所有内容都可以使用以下脚本完成:
A = LOAD '/home/cswhite/data.tsv' USING PigStorage('\t') AS (SESSION, USER_ID, TIMESTAMP);
B = foreach A GENERATE USER_ID, SUBSTRING(TIMESTAMP, 0, 10) AS DATE;
BF = filter B by DATE > '2013-01-01';
C = group BF by USER_ID;
D = foreach C {
sorted = order BF by DATE;
earliest = limit sorted 1;
generate group, flatten(earliest);
}
E = foreach D generate group as USER_ID, earliest::DATE as DATE;
F = group E by DATE;
G = foreach F generate group as DATE, COUNT(E) as USERS_CNT;
H = group G by ALL
I = foreach G generate SUM(G.USERS_CNT) as TOTAL_USERS;
因此,对于以下标签分隔的输入数据:
1 99 2013-01-01 2:23:33
2 101 2013-01-01 2:23:55
3 104 2013-01-01 2:24:41
4 101 2013-01-01 2:24:43
5 101 2013-01-02 2:25:01
6 102 2013-01-02 2:26:01
7 99 2013-01-03 2:27:01
8 92 2013-01-04 2:28:01
9 234 2013-01-05 2:29:01
别名G
如下:
(2013-01-01,3)
(2013-01-02,1)
(2013-01-04,1)
(2013-01-05,1)
别名'我'是:
(5)