Question

我有一个包含以下数据的 ndarray 数据集 data：['1 2 3' '2 3 4' '3 4 5' '4 5 6' '1 3 5' '2 4 6' '1 3 4' '2 4 5' '3 5 6' '1 2 4' '2 3 5' '3 4 6']，来自一个空格分隔的文本文件，长度为12行。< /p>

我还使用以下数据生成了一组名为 candidates 的单独元组：{('2', '4'), ('2', '5'), ('2' , '6'), ('3', '4'), ('4', '5'), ('1', '5'), ('1', '3'), ('3' , '6'), ('4', '6'), ('2', '3'), ('5', '6'), ('1', '2'), ('1' , '4'), ('3', '5'), ('1', '6')}。

counts = {} # Count occurrences of all pairs
candidates = set(combinations(frequents,2)) # Generate combos of frequent items

我想查看候选中每个元组的两个元素是否存在于当前数据行中。如果是这样，我将该候选添加到字典 counts 并增加其出现次数。正如预期的那样，输出是正确的，如下所示。

{"('1', '3')": 3, "('2', '3')": 3, "('1', '2')": 2, "('2') ', '4')": 4, "('3', '4')": 4, "('4', '5')": 3, "('3', '5')": 4, "('4', '6')": 3, "('5', '6')": 2, "('1', '5')": 1, "('2', '6')": 1, "('1', '4')": 2, "('2', '5')": 2, "('3', '6')": 2}

我当前的实现确实适用于较小的数据集，但我正在尝试将其扩展到非常大的数据集（~30,000 行）。对于那些非常大的数据集来说，这种实现非常慢，所以我想知道对我来说检查每个 candidate 和每个 data 行的更有效方法是什么？这个实现不需要嵌套的 for 循环吗？

for row in data: # For each line in data set
    for candidate in candidates: # Compare each line against each candidate
        if candidate[0] in row and candidate[1] in row:
            candidate = str(candidate)
                if counts.get(candidate): # +1 if already in dict
                    counts[candidate] += 1
                else: # Start count if not in dict
                    counts[candidate] = 1

return counts

如何加快嵌套 for 循环的迭代速度？

0 个答案: