Question

我正在研究根据SERP中相同网址的数量对搜索引擎中的关键字进行分组的算法。每个组代表一个url，每个值都是SERP的关键字id，其中url出现。

我有小组列表：

groups = [
    [1],
    [1, 2 ,3],
    [1, 2, 3, 4, 5],
    [1, 2, 3 ,4], 
    [2, 3],
    [4, 5, 6],
    [4, 5, 7]
]

我需要按“大小”减少的顺序获取至少在N组中出现的所有项目集：

在上面的示例中，对于N = 3，我们有两个子集： [1,2,3]和[4,5]

我看到如何获取它：

迭代1：找到至少出现3次的最大集合（它是[1,2,3]）并从所有集合中删除它出现的位置。

迭代后我们有：

groups = [
        [1],
        [4, 5],
        [4], 
        [2, 3],
        [4, 5, 6],
        [4, 5, 7]
    ]

迭代2：找到至少出现3次的最大值（它是[4,5]）

迭代后我们有：

groups = [
        [1],
        [4], 
        [2, 3],
        [6],
        [7]
    ]

算法结束：因为没有更多的集合在群组中至少出现3次。

你对算法获取它们有什么想法吗？

N 介于1到10之间。

P.S。组列表相当大，从1000到10000项。数字是db中对象的ID。

Answer 1

第一个原型方法/ hack结合了递归的美感，伪函数式编程和来自我身边的***。有很多改进可能，特别是关于迭代器/列表。也许这甚至可以作为意大利面条代码： - ）。

警告：请参阅@John Coleman关于二项式系数的评论。我们在每次迭代中生成所有可能的剩余值子集。如果生成器被懒惰地使用（但对于大量唯一数字仍然是不可行的），它可能会得到改善。

import itertools

groups = [
    [1],
    [1, 2 ,3],
    [1, 2, 3, 4, 5],
    [1, 2, 3 ,4],
    [2, 3],
    [4, 5, 6],
    [4, 5, 7]
]

def solve(groups, N, sol=[]):
    if len(groups) == 0:
        return sol

    rem_vals = list(set(itertools.chain(*groups)))
    combs = list(itertools.product(range(2), repeat=len(rem_vals)))
    combs_ = [[rem_vals[ind] for ind, i in enumerate(combs[comb]) if i] for comb in range(len(combs))]

    for cand in reversed(sorted(combs_, key=lambda x: len(list(itertools.chain(x))))):
        if len(cand) == 0:
            continue
        else:
            counter = 0
            inds = []
            for ind, g in enumerate(groups):
                if set(cand).issubset(g):
                    counter += 1
                    inds.append(ind)

            if counter >= N:
                sol.append(cand)
                for i in inds:
                    for j in cand:
                        groups[i].remove(j)
                return solve(groups, N, sol)

    return sol

print(solve(groups, 3))

输出

[[1, 2, 3], [4, 5]]

查找至少出现在N个不同集中的集列表中的所有子集

1 个答案: