Question

我经常需要根据一列中项目的共现来生成网络图。我从这样的事情开始：

           letters
0  [b, a, e, f, c]
1        [a, c, d]
2        [c, b, j]

在下面的示例中，我希望创建一个包含所有字母对的表格，然后有一个“权重”列，该列描述每两个字母对在同一行中一起出现多少次 em>（例如，参见底部）。

我目前正在使用for循环处理它的大部分内容，我想知道是否有一种方法可以对它进行矢量化处理，因为我经常要处理大量的数据集，而这种数据集需要花费很长时间才能处理。我还担心将事情限制在内存范围内。这是我现在的代码：

import pandas as pd # Make some data df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]}) # I make a list of sets, which contain pairs of all the elements # that co-occur in the data in the same list sets = [] for lst in df['letters']: for i, a in enumerate(lst): for b in lst[i:]: if not a == b: sets.append({a, b}) # Sets now looks like: # [{'a', 'b'}, # {'b', 'e'}, # {'b', 'f'},... # Dataframe with one column containing the sets df = pd.DataFrame({'weight': sets}) # We count how many times each pair occurs together df = df['weight'].value_counts().reset_index() # Split the sets into two seperate columns split = pd.DataFrame(df['index'].values.tolist()) \ .rename(columns = lambda x: f'Node{x+1}') \ .fillna('-') # Merge the 'weight' column back onto the dataframe df = pd.concat([df['weight'], split], axis = 1) print(df.head) # Output: weight Node1 Node2 0 2 c b 1 2 a c 2 1 f e 3 1 d c 4 1 j b

Answer 1

使用稀疏入射矩阵的numpy / scipy解决方案：

from itertools import chain
import numpy as np
from scipy import sparse
from simple_benchmark import BenchmarkBuilder, MultiArgument

B = BenchmarkBuilder()

@B.add_function()
def pp(L):
    SZS = np.fromiter(chain((0,),map(len,L)),int,len(L)+1).cumsum()
    unq,idx = np.unique(np.concatenate(L),return_inverse=True)
    S = sparse.csr_matrix((np.ones(idx.size,int),idx,SZS),(len(L),len(unq)))
    SS = (S.T@S).tocoo()
    idx = (SS.col>SS.row).nonzero()
    return unq[SS.row[idx]],unq[SS.col[idx]],SS.data[idx] # left, right, count


from collections import Counter
from itertools import combinations

@B.add_function()
def yatu(L):
    return Counter(chain.from_iterable(combinations(sorted(i),r=2) for i in L))

@B.add_function()
def feature_engineer(L):
    Counter((min(nodes), max(nodes))
            for row in L for nodes in combinations(row, 2))

from string import ascii_lowercase as ltrs

ltrs = np.array([*ltrs])

@B.add_arguments('array size')
def argument_provider():
    for exp in range(4, 30):
        n = int(1.4**exp)
        L = [ltrs[np.maximum(0,np.random.randint(-2,2,26)).astype(bool).tolist()] for _ in range(n)]
        yield n,L

r = B.run()
r.plot()

我们看到这里介绍的方法（pp）具有典型的numpy常量开销，但是从大约100个子列表开始，它就开始赢了。

OP示例：

import pandas as pd

df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
pd.DataFrame(dict(zip(["left", "right", "count"],pp(df['letters']))))

打印：

   left right  count
0     a     b      1
1     a     c      2
2     b     c      2
3     c     d      1
4     a     d      1
5     c     e      1
6     a     e      1
7     b     e      1
8     c     f      1
9     e     f      1
10    a     f      1
11    b     f      1
12    b     j      1
13    c     j      1

Answer 2

注意：

根据其他答案的建议，使用collections.Counter进行计数。由于它的行为类似于dict，因此需要可散列的类型。 {a,b}不可散列，因为它是一个集合。用元组代替它可以解决哈希性问题，但会引入可能的重复项（例如('a', 'b')和('b', 'a')）。要解决此问题，只需对元组进行排序。

由于sorted返回一个list，因此我们需要将其变成一个元组：tuple(sorted((a,b)))。有点麻烦，但与Counter结合使用很方便。

快速简便的加速：理解而不是循环

重新排列后，可以用以下理解替换嵌套循环：

sets = [ sorted((a,b)) for lst in df['letters'] for i,a in enumerate(lst) for b in lst[i:] if not a == b ]

Python已对理解执行进行了优化，因此这已经带来了一定的提速。

奖金：如果将其与Counter结合使用，甚至不需要将结果作为列表使用，而是可以使用 generator表达式（几乎没有多余的内存被使用）所有对的存储）：

Counter( tuple(sorted((a, b))) for lst in lists for i,a in enumerate(lst) for b in lst[i:] if not a == b ) # note the lack of [ ] around the comprehension

评估：更快的方法是什么？

通常，在处理性能时，最终的答案必须来自测试不同的方法并选择最佳方法。在这里，我比较了@yatu基于IMO（非常优雅且易读）itertools的方法，原始的嵌套方法和理解方法。所有测试都在相同的样本数据上运行，这些数据是随机生成的，类似于给定的示例。

from timeit import timeit

setup = '''
import numpy as np
import random
from collections import Counter
from itertools import combinations, chain
random.seed(42)
np.random.seed(42)

DF_SIZE = 50000 # make it big
MAX_LEN = 6
list_lengths = np.random.randint(1, 7, DF_SIZE)

letters = 'abcdefghijklmnopqrstuvwxyz'

lists = [ random.sample(letters, ln) for ln in list_lengths ] # roughly equivalent to df.letters.tolist()
'''

#################

comprehension = '''Counter( tuple(sorted((a, b))) for lst in lists for i,a in enumerate(lst) for b in lst[i:] if not a == b )'''
itertools = '''Counter(chain.from_iterable(combinations(sorted(i), r=2) for i in lists))'''
original_for_loop = '''
sets = []
for lst in lists:
    for i, a in enumerate(lst):
        for b in lst[i:]:
            if not a == b:
                sets.append(tuple(sorted((a, b))))
Counter(sets)
'''

print(f'Comprehension: {timeit(setup=setup, stmt=comprehension, number=10)}')
print(f'itertools: {timeit(setup=setup, stmt=itertools, number=10)}')
print(f'nested for: {timeit(setup=setup, stmt=original_for_loop, number=10)}')

在我的计算机（python 3.7）上运行上面的代码会打印：

Comprehension: 1.6664735930098686
itertools: 0.5829475829959847
nested for: 1.751666523006861

因此，两种建议的方法都比嵌套的for循环有所改进，但是在这种情况下itertools确实更快。

Answer 3

为了提高性能，您可以使用itertooos.combinations来从内部列表中获取所有长度的2组合，并使用Counter来计算扁平化列表中的对。

请注意，除了从每个子列表中获取所有组合之外，排序也是必要的步骤，因为它将确保所有成对的元组以相同的顺序出现：

from itertools import combinations, chain
from collections import Counter

l = df.letters.tolist()
t = chain.from_iterable(combinations(sorted(i), r=2) for i in l)

print(Counter(t))

Counter({('a', 'b'): 1,
         ('a', 'c'): 2,
         ('a', 'e'): 1,
         ('a', 'f'): 1,
         ('b', 'c'): 2,
         ('b', 'e'): 1,
         ('b', 'f'): 1,
         ('c', 'e'): 1,
         ('c', 'f'): 1,
         ('e', 'f'): 1,
         ('a', 'd'): 1,
         ('c', 'd'): 1,
         ('b', 'j'): 1,
         ('c', 'j'): 1})

有没有一种方法可以对熊猫/ numpy中的物品共现进行矢量化处理？

3 个答案:

注意：

快速简便的加速：理解而不是循环

评估：更快的方法是什么？