博客中最频繁的3页序列

时间:2010-06-07 16:54:47

标签: algorithm

给定一个由“用户”页面网址字段组成的网络日志。我们必须找出用户最常用的3页序列。

有时间戳。并且不保证单个用户访问将按顺序记录,它可以像user1 Page1 user2 Pagex user1 Page2 User1 Pagex user1 Page 2她的User1s页面序列是page1-> page2->第3页

6 个答案:

答案 0 :(得分:4)

假设您的日志按时间戳顺序存储,这里有一个算法可以满足您的需求:

  1. 创建哈希表'user_visits'映射用户ID到您观察他们访问的最后两页
  2. 创建哈希表'visit_count',将3元组页面映射到频率计数
  3. 对于日志中的每个条目(用户,URL):
    1. 如果带有两个条目的user_visits中存在'user',请将对应于三元组URL的visit_count中的条目递增一个
    2. 将“URL”附加到user_visits中的相关条目,并在必要时删除最旧的条目。
  4. 按值对visit_count哈希表进行排序。这是您最常用的网址序列列表。
  5. 这是Python中的一个实现,假设您的字段是以空格分隔的:

    fh = open('log.txt', 'r')
    user_visits = {}
    visit_counts = {}
    for row in fh:
      user, url = row.split(' ')
      prev_visits = user_visits.get(user, ())
      if len(prev_vists) == 2:
        visit_tuple = prev_vists + (url,)
        visit_counts[visit_tuple] = visit_counts.get(visit_tuple, 0) + 1
      user_visits[user] = (prev_vists[1], url)
    popular_sequences = sorted(visit_counts, key=lambda x:x[1], reverse=True)
    

答案 1 :(得分:3)

又快又脏:

  • 建立每个网址/时间戳列表 用户
  • 按时间戳排序每个列表
  • 遍历每个列表
    • 对于每个3个URL序列,创建或增加一个计数器
  • 在网址序列计数列表中找到最高计数

foreach(entry in parsedLog)
{
    users[entry.user].urls.add(entry.time, entry.url)
}

foreach(user in users)
{
    user.urls.sort()
    for(i = 0; i < user.urls.length - 2; i++)
    {
        key = createKey(user.urls[i], user.urls[i+1], user.urls[i+2]
        sequenceCounts.incrementOrCreate(key);
    }
}

sequenceCounts.sortDesc()
largestCountKey = sequenceCounts[0]
topUrlSequence = parseKey(largestCountkey)

答案 2 :(得分:2)

这里有一些SQL假设您可以将您的日志记录到诸如

之类的表中
CREATE TABLE log (
   ord  int,
   user VARCHAR(50) NOT NULL,
   url  VARCHAR(255) NOT NULL,
   ts   datetime
) ENGINE=InnoDB;

如果没有按用户对数据进行排序(假设ord列是日志文件中的行号)

SELECT t.url, t2.url, t3.url, count(*) c
FROM  
      log t INNER JOIN
      log t2 ON t.user = t2.user INNER JOIN
      log t3 ON t2.user = t3.user
WHERE 
      t2.ord IN (SELECT MIN(ord) 
                 FROM log i 
                 WHERE i.user = t.user AND i.ord > t.ord) 
      AND
      t3.ord IN (SELECT MIN(ord) 
                 FROM log i 
                 WHERE i.user = t.user AND i.ord > t2.ord)
GROUP BY t.user, t.url, t2.url, t3.url
ORDER BY c DESC
LIMIT 10;

这将为用户提供前十个3停止路径。或者,如果您可以按用户和时间订购,则可以更轻松地加入rownumber。

答案 3 :(得分:1)

Mathematica中的源代码

s= { {user},{page} }  (* load List (log) here *)

sortedListbyUser=s[[Ordering[Transpose[{s[[All, 1]], Range[Length[s]]}]] ]]

Tally[Partition [sortedListbyUser,3,1]]

答案 4 :(得分:1)

此问题与

类似
  

从文件中找出k个最常用的单词

以下是解决问题的方法:

  • 将每个三元组(第1页,第2页,第3页)分组为单词
  • 应用提到的算法here

答案 5 :(得分:1)

1.Reads user page access urls from file line by line,these urls separated by separator,eg: 
u1,/
u1,main
u1,detail

The separator is comma.
2.Store each page's visit count to map:pageVisitCounts;
3.Sort the visit count map by value in descend order;

public static Map<String, Integer> findThreeMaxPagesPathV1(String file, String separator, int depth) {
    Map<String, Integer> pageVisitCounts = new HashMap<String, Integer>();
    if (file == null || "".equals(file)) {
        return pageVisitCounts;
    }
    try {
        File f = new File(file);
        FileReader fr = new FileReader(f);
        BufferedReader bf = new BufferedReader(fr);

        Map<String, List<String>> userUrls = new HashMap<String, List<String>>();
        String currentLine = "";
        while ((currentLine = bf.readLine()) != null) {
            String[] lineArr = currentLine.split(separator);
            if (lineArr == null || lineArr.length != (depth - 1)) {
                continue;
            }
            String user = lineArr[0];
            String page = lineArr[1];
            List<String> urlLinkedList = null;
            if (userUrls.get(user) == null) {
                urlLinkedList = new LinkedList<String>();
            } else {
                urlLinkedList = userUrls.get(user);
                String pages = "";
                if (urlLinkedList.size() == (depth - 1)) {
                    pages = urlLinkedList.get(0).trim() + separator + urlLinkedList.get(1).trim() + separator + page;
                } else if (urlLinkedList.size() > (depth - 1)) {
                    urlLinkedList.remove(0);
                    pages = urlLinkedList.get(0).trim() + separator + urlLinkedList.get(1).trim() + separator + page;
                }
                if (!"".equals(pages) && null != pages) {
                    Integer count = (pageVisitCounts.get(pages) == null ? 0 : pageVisitCounts.get(pages))  + 1;
                    pageVisitCounts.put(pages, count);
                }
            }
            urlLinkedList.add(page);
            System.out.println("user:" + user + ", urlLinkedList:" + urlLinkedList);
            userUrls.put(user, urlLinkedList);
        }
        bf.close();
        fr.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    return pageVisitCounts;
}

public static void main(String[] args) {
    String file = "/home/ieee754/Desktop/test-access.log";
    String separator = ",";
    Map<String, Integer> pageVisitCounts = findThreeMaxPagesPathV1(file, separator, 3);
    System.out.println(pageVisitCounts.size());
    Map<String, Integer>  result = MapUtil.sortByValueDescendOrder(pageVisitCounts);
    System.out.println(result);
}