我想路径优化用户旅行我的网站特定时间旅行特别是在猪的时间

时间:2015-08-14 16:35:30

标签: hadoop hive apache-pig

输入数据集:

(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com/home) 
(2012-07-21T14:01:00.000Z, mary, hxxp:///www.aaa.com/watch)  
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com/movie)
(2012-07-21T14:01:00.000Z, mary, hxxp:///www.aaa.com/mobile) 

预期产出:

(joe (hxxp:///www.aaa.com/home, hxxp:///www.aaa.com/movie))
(mary(hxxp:///www.aaa.com/watch, hxxp:///www.aaa.com/mobile))

我想在apache pig中做这样的路径分析项目

用户如何旅行我的网站,我想路径优化  用户首先看到该网站hxxp:///www.aaa.com/home 2秒后他转移到hxxp:///www.aaa.com/movie这个页面我希望分析用户旅行我的网站特定时间在旅行中

1 个答案:

答案 0 :(得分:1)

输入:

2012-07-21T14:00:00.000Z,joe,hxxp:///www.aaa.com/home
2012-07-21T14:01:00.000Z,mary,hxxp:///www.aaa.com/watch
2012-07-21T14:02:00.000Z,joe,hxxp:///www.aaa.com/movie
2012-07-21T14:01:00.000Z,mary,hxxp:///www.aaa.com/mobile

猪脚本:

user_navigation_data = LOAD 'user_nav_data.csv'  USING  PigStorage(',') AS (time:datetime,user:chararray,url:chararray);
nav_data_grp_user = GROUP user_navigation_data BY user;
user_nav_stats = FOREACH nav_data_grp_user {
      user_navigation_data_ord = ORDER user_navigation_data BY time;
      GENERATE group AS user, BagToString(user_navigation_data_ord.url,'-->') AS urls_accessed;
};

输出:DUMP user_nav_stats:

(joe,hxxp:///www.aaa.com/home-->hxxp:///www.aaa.com/movie)
(mary,hxxp:///www.aaa.com/watch-->hxxp:///www.aaa.com/mobile)