
时间:2019-03-22 23:12:12

标签: python pandas scipy

以下问题来自于我之前提出的一个问题:Python - How to speed up cosine similarity with counting arrays


import numpy as np
import pandas as pd
import networkx as nx
from scipy import spatial

def compute_other(user_1, user_2):
    uniq = list(set(user_1[0] + user_2[0]))

    duniq = {k:0 for k in uniq}    

    u1 = create_vector(duniq, list(user_1[0]))
    u2 = create_vector(duniq, list(user_2[0]))

    return 1 - spatial.distance.cosine(u1, u2)

distances = spatial.distance.cdist(df[['ARTIST']], df[['ARTIST']], metric=compute_other)

idx_to_remove = np.triu_indices(len(distances))
distances[idx_to_remove] = 0

df_dist = pd.DataFrame(distances, index = df.index, columns = df.index)
edges = df_dist.stack().to_dict()
edges = {k: v for k, v in edges.items() if v > 0}

print('NET inference')
net = nx.Graph()

我要注意的第一件事是,我计算了完整的矩阵并删除了其中的一半,因此只计算其中的一半会很酷 我需要(这将是x2)。


"(75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 75751, 15053)"
"(55852, 55852, 17727, 17727, 2182)"
"(11446, 11446, 11446, 11446, 11446, 11446, 11446, 11446)"
"(22873, 22873, 22873, 22873)"
"(5634, 5634)"
"(311, 18672)"
"(1740, 1740, 1740, 1740, 1746, 15048, 15048, 1740)"
"(1788, 1983, 1788, 1748, 723, 100744, 723, 226, 1583, 12188, 51325, 1748, 75401, 1171)"
"(59173, 59173)"
"(2673, 2673, 2673, 2673, 2673, 2673, 2673, 5634, 5634, 5634)"
"(2251, 4229, 14207, 1744, 16366, 1218)"
"(19703, 1171, 1171)"
"(1243, 8249, 2061, 1243, 13343, 9868, 574509, 892, 1080, 1243, 3868, 2061, 4655)"
"(3868, 60112, 11084)"
"(15869, 15869, 15869, 15869)"
"(4067, 4067, 4067, 4067, 4067, 4067)"
"(1171, 1171, 1171, 1171)"
"(1245, 1245, 1245, 1245, 1245, 1245, 1245, 1245, 1245, 1195, 1193, 1193, 1193, 1193, 1193, 1193)"
"(723, 723)"  


import ast
import pandas as pd

df = pd.read_csv('Stack.csv')
df['ARTIST'] = df['ARTIST'].apply(lambda x : ast.literal_eval(x))



def create_vector(duniq, l):
    dx = duniq.copy()
    dx.update(Counter(l)) # Count the values
    return list(dx.values()) # Return a list

1 个答案:

答案 0 :(得分:1)

我试图对此进行修改,但是我在两行中遇到了编译错误:  u1 = create_vector(duniq,list(user_1 [0]))  u2 = create_vector(duniq,list(user_2 [0]))


我怀疑在您的df上使用遮罩可能会通过消除正在执行的覆盖来提高性能 距离[idx_to_remove] = 0 并应减少的迭代次数 “ edges = {k:对于k,v,如果v> 0},则在edges.items()中使用v””


嗨,Guido。很抱歉,花了这么长时间,但这真是难以克服! 在尝试了几种不同的方法(甚至花费了更长的时间)之后,我想出了以下方法来代替您的create_vector()和compute_other()函数:

def compute_other2(user_1, user_2):
    uniq = set(user_1[0] + user_2[0]) #create list of unique list of items in user _1 and user_2   
    u1 = [user_1[0].count(ui) for ui in uniq]
    u2 = [user_2[0].count(ui) for ui in uniq]
    return 1 - spatial.distance.cosine(u1, u2)

我的性能提高了20%,比我期望的要少,但是有一些。 注意:我仍在使用“ spatial.distance.cdist”运行您的代码。我确实看到您通过切换到“ spatial.distance.pdist”获得了50%的收益。我不确定您是如何使用它的(我怀疑是矢量数学)超出了我的范围。也许您可以将此新的compute_other()函数与spatial.distance.pdist一起使用,并获得更多收益。
