Question

亚马逊的采访问题是：

给定包含（User_Id，URL，Timestamp）的日志文件，用户可以将页面从一个页面导航到另一个页面。找到重复最多次数的三页子集序列。记录按时间戳排序。

我从this reddit thread找到了这个问题。

海报wrote：

“给定一个包含（User_Id，URL，Timestamp）的日志文件，用户可以将页面从一个页面导航到另一个页面。查找重复最多次数的三页子集序列。记录按时间戳排序。”

（虽然直到采访的最后阶段才告诉我，文件按时间戳排序。我问过的第一件事就是日志是否排序，我的面试官说没有。）

我认为我全力以赴 - 我似乎已经使用散列图在正确的轨道上了。我总是让我的面试知道我在想什么，并给出了可能的结果，时间的复杂性等等。

我不知道如何解决这个问题。 “查找重复最多次数的三页子集序列”是什么意思？如果问题没有说“记录按时间戳排序”（正如海报中所发生的那样），那么这将如何影响问题呢？

Answer 1

使用“三页子集序列”我猜它们意味着三个页面必须彼此相邻，但它们的内部顺序无关紧要。（A B C = C A B）

public Tuple<string,string,string> GetMostFrequentTriplet(
        IEnumerable<LogEntry> entries,
        TimeSpan? maxSeparation = null)
{
    // Assuming 'entries' is already ordered by timestamp

    // Store the last two URLs for each user
    var lastTwoUrls = new Dictionary<int,Tuple<string,string,DateTime>>();
    // Count the number of occurences of each triplet of URLs
    var counters = new Dictionary<Tuple<string,string,string>,int>();

    foreach (var entry in entries)
    {
        Tuple<string,string,DateTime> lastTwo;
        if (!lastTwoUrls.TryGetValue(entry.UserId, out lastTwo))
        {
            // No previous URLs
            lastTwoUrls[entry.UserId] = Tuple.Create((string) null, entry.Url, entry.Timestamp);
        }
        // (comparison with null => false)
        else if (entry.Timestamp - lastTwo.Item3 > maxSeparation) {
            // Treat a longer separation than maxSeparation as two different sessions.
            lastTwoUrls[entry.UserId] = Tuple.Create((string) null, entry.Url, entry.Timestamp);
        }
        else
        {
            // One or two previous URLs
            if (lastTwo.Item1 != null)
            {
                // Two previous URLs; Three with this one.

                // Sort the three URLs, so that their internal order won't matter
                var urls = new List<string> { lastTwo.Item1, lastTwo.Item2, entry.Url };
                urls.Sort();
                var key = Tuple.Create(urls[0], urls[1], urls[2]);

                // Increment count
                int count;
                counters.TryGetValue(key, out count); // sets to 0 if not found
                counters[key] = count + 1;
            }

            // Shift in the new value, discarding the oldest one.
            lastTwoUrls[entry.UserId] = Tuple.Create(lastTwo.Item2, entry.Url, entry.Timestamp);
        }
    }

    Tuple<string,string,string> maxKey = null;
    int maxCount = 0;

    // Find the key with the maximum count
    foreach (var pair in counters)
    {
        if (maxKey == null || pair.Value > maxCount)
        {
            maxKey = pair.Key;
            maxCount = pair.Value;
        }
    }

    return maxKey;
}

代码遍历日志条目并为每个用户分隔流。对于任何用户的每三个连续URL，我们增加该三元组的计数。由于这三个页面的顺序并不重要，我们通过排序以一致的方式对它们进行重新排序。最后，我们返回具有最高计数的三元组。

由于我们只需要每个用户的最后三个URL，我们只存储前两个。结合当前的URL，这使我们需要三元组。

对于 n 网址， m 唯一网址， u 用户和 s 单一访问用户，方法将执行2 n - 2 u + s （= O（ n ））字典查找和存储最多C（ m ，3）+ u （= O（ m ³ + u < / em>））元组。

修改按请求之间的持续时间推断会话。如果它们的差异超过maxSeparation，则新请求将被视为该用户的第一个请求。

亚马逊访谈：时间戳排序：查找重复最多次数的三页子集序列

1 个答案: