Question

我正在制作一个实时统计系统，它会在内存中维护一个最常访问的URL路径列表（仅限路径，参数被剥离）。

我考虑过“最大堆”，但鉴于URI模式不同（无法预测新模式），我无法使用该数据结构。

我想到的是你需要记录每个不同URI的数量，例如

www.google.com/pathA   5 times
www.google.com/pathB   3 times
...

因此，每当发现新的URI模式时，您需要为其初始化一个条目，否则您可能只省略一个关键URI。

你不能真正“保持前100名”。

然后，在不占用大量内存空间的情况下实现它似乎是不可能的。

有什么建议吗？

Answer 1

虽然它没有完全符合您的要求但我认为splay tree正是您所需要的。它是一种出色的数据结构，具有将最近访问的元素和最常访问的元素保持在更靠近根的特性。

如果这不适合您，请使用堆并在需要时更新元素的优先级。你不能用内置堆做到这一点，但它并不难实现。

Answer 2

如果你想确定，你列出了前100名，那你就是对。

你可以为此编写一些启发式方法。例如，你可以记录前100名和后100名。新的100将是第二名将是第二个列表，其中网址可能会成为前100名之一。它可以计入前100名。如果您访问的网址不是前100名。在前100名和后100名中，您将从最后100名中删除某位，即最后访问的网址。

如果某人一个接一个地访问101个网址，它将无效，但这是一个好的开始。您可以考虑应该删除的不同的初学者等等。

示例实施：

top100 : list<(URL, count)>
last100: list<(URL, count, score)>

process(URL){
    if(URL in top100) incrementCount top100[URL];
    elif(URL in last100){
        incrementScore last100[URL];
        newCount := incrementCount last100[URL];
        if (newCount > top100.lowestCount)
            swap this URL between last100 and top100 
        }
    else{
        //perform check if should change sth in last100, i.e.:
        if(exists score=0 in last100)
            remove score0 from last100.
            put (URL, 1, 0) to last100;
        }
        else{
            decrement all score in last100
        }
     }
 }

简单运行top / last 3而不是100。让我们从中间开始，时间： top3 = [（A，10），（B，4），（C，3）] last3 = [（E，2,0），（F，1,0），（G，1,0）]（A..G是URL）

G：last3 = [（E，2,0），（G，2,1），（F，1,0）] // inc G得分，计数

G：last3 = [（E，2,0），（G，3,2），（F，1,0）] // inc G得分，计数

H：last3 = [（E，2,0），（G，3,2），（H，1,0）] //把H代替F

F：last3 = [（E，2,0），（G，3,2），（F，1,0）] //把F代替G

G：top3 = [（A，10），（B，4），（G，4）]，[（E，2,0），（C，3,2），（F，1,0））] //交换GC

G：top3 = [（A，10），（B，4），（G，5）] //包含G计数

F：last3 = [（E，2,0），（G，3,2），（F，2,1）] // inc F得分，计数

E：last3 = [（E，3,1），（G，3,2），（F，2,1）] // inc E得分，计数

H：last3 = [（E，3,0），（G，3,1），（F，2,0）] //没有el得分= 0，dec所有得分

H：last3 = [（E，3,0），（G，3,1），（H，1,0）] //把H代替F

所以F和G经常出现，但不幸的是他们阻止对方保持在最后3，并进入top3。在具有last / top100（或更多）的真实单词场景中，这样的情况很难发生。

更复杂的策略应该操纵分数和计数以改进决定是否应该放置新URL，如果是，应该删除哪个URL。您应准备一些样本数据并制定高质量的策略。

Answer 3

更新：对不起我的解决方案，如果最近使用的，而不是最流行的。在回答之前我没有正确地阅读问题。

我认为你在寻找什么是LRU缓存或最近最少使用的缓存我们将使用订购模式'true'扩展LinkedHashMap以保持排序。并覆盖'removeEldestEntry'以在size超过max Entries时返回true。在你的情况下，maxEntries = 100。

有关LinkedHashMap的详细信息，请查看（http://docs.oracle.com/javase/6/docs/api/java/util/LinkedHashMap.html）

private class LruCache<A, B> extends LinkedHashMap<A, B> {
    private final int maxEntries;

    public LruCache(final int maxEntries) {
       /* Using constructor LinkedHashMap(int initialCapacity, float loadFactor, boolean accessOrder) 
         Which Constructs an empty LinkedHashMap instance with the specified initial   
         capacity, load factor and ordering mode. */
        super(maxEntries + 1, 1.0f, true);
        this.maxEntries = maxEntries;
    }

   /* Returns true if this <code>LruCache</code> has more entries than the 
      maximum specified when it was created.*/
    @Override
    protected boolean removeEldestEntry(final Map.Entry<A, B> eldest) {
        return super.size() > maxEntries;
    }
}

Map<String, String> example = Collections.synchronizedMap(new LruCache<String, String>(CACHE_SIZE));

在内存中保留前100名列表

3 个答案: