从Pandas数据框创建矩阵以显示连通性

时间:2016-05-11 22:07:43

标签: python matrix

我在pandas数据帧中以这种格式获取数据:

Customer_ID  Location_ID
Alpha             A
Alpha             B
Alpha             C
Beta              A
Beta              B
Beta              D

我想研究客户的移动模式。我的目标是确定客户最常光顾的位置集群。我认为以下矩阵可以提供这样的信息:

   A  B  C  D
A  0  2  1  1
B  2  0  1  1
C  1  1  0  0
D  1  1  0  0

我如何在Python中这样做?

我的数据集非常庞大(成千上万的客户和大约一百个地点)。

2 个答案:

答案 0 :(得分:1)

以下是一种考虑到访问次数的方法(例如,如果客户X两次访问LocA和LocB,他将向最终矩阵中的相应位置提供2。)

点子:

  1. 对于每个位置,计算客户的访问次数。
  2. 对于每个位置对,找到访问过这两个位置对的每个客户的最小访问次数总和。
  3. 使用unstack并清理。
  4. Counter在这里播放得很好,因为计数器支持许多自然算术运算,例如addmax等。

    import pandas as pd
    from collections import Counter
    from itertools import product
    
    df = pd.DataFrame({
        'Customer_ID': ['Alpha', 'Alpha', 'Alpha', 'Beta', 'Beta'],
        'Location_ID': ['A', 'B', 'C', 'A', 'B'],
        })
    
    
    ctrs = {location: Counter(gp.Customer_ID) for location, gp in df.groupby('Location_ID')}
    
    
    # In [7]: q.ctrs
    # Out[7]:
    # {'A': Counter({'Alpha': 1, 'Beta': 1}),
    #  'B': Counter({'Alpha': 1, 'Beta': 1}),
    #  'C': Counter({'Alpha': 1})}
    
    
    ctrs = list(ctrs.items())
    overlaps = [(loc1, loc2, sum(min(ctr1[k], ctr2[k]) for k in ctr1))
        for i, (loc1, ctr1) in enumerate(ctrs, start=1)
        for (loc2, ctr2) in ctrs[i:] if loc1 != loc2]
    overlaps += [(l2, l1, c) for l1, l2, c in overlaps]
    
    
    df2 = pd.DataFrame(overlaps, columns=['Loc1', 'Loc2', 'Count'])
    df2 = df2.set_index(['Loc1', 'Loc2'])
    df2 = df2.unstack().fillna(0).astype(int)
    
    
    #      Count
    # Loc2     A  B  C
    # Loc1
    # A        0  2  1
    # B        2  0  1
    # C        1  1  0
    

    如果您想忽略多重性,请将Counter(gp.Customer_ID)替换为Counter(set(gp.Customer_ID))

答案 1 :(得分:1)

我确信这是一种更优雅的方式,但这是我在飞行中提出的解决方案。基本上,您为每个客户构建一个邻接列表,然后相应地更新邻接矩阵:

import pandas as pd

#I'm assuming you can get your data into a pandas data frame:
data = {'Customer_ID':[1,1,1,2,2],'Location':['A','B','C','A','B']}
df = pd.DataFrame(data)

#Initialize an empty matrix
matrix_size = len(df.groupby('Location'))
matrix = [[0 for col in range(matrix_size)] for row in range(matrix_size)]

#To make life easier, I made a map to go from locations 
#to row/col positions in the matrix
location_set = list(set(df['Location'].tolist()))
location_set.sort()
location_map = dict(zip(location_set,range(len(location_set))))

#Group data by customer, and create an adjacency list (dyct) for each
#Update the matrix accordingly
for name,group in df.groupby('Customer_ID'):
    locations = set(group['Location'].tolist())
    dyct = {}
    for i in locations:
        dyct[i] = list(locations.difference(i))

    #Loop through the adjacency list and update matrix
    for node, edges in dyct.items(): 
        for edge in edges:
            matrix[location_map[edge]][location_map[node]] +=1