在Numpy中有效地构造稀疏的biadjacency矩阵

时间:2014-11-26 23:49:46

标签: python performance numpy graph

我正在尝试将此CSV文件加载到稀疏的numpy矩阵中,该矩阵表示此用户到subreddit二分图的biadjacency矩阵:http://figshare.com/articles/reddit_user_posting_behavior/874101

以下是一个示例:

603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww

有876,961行(每个用户一个)和15,122个subreddits以及总共8,495,597个用户到subreddit关联。

这是我现在的代码,在我的MacBook Pro上运行需要20分钟:

import numpy as np
from scipy.sparse import csr_matrix 

row_list = []
entry_count = 0
all_reddits = set()
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for x in f:
        pieces = x.rstrip().split(",")
        user = pieces[0]
        reddits = pieces[1:]
        entry_count += len(reddits)
        for r in reddits: all_reddits.add(r)
        row_list.append(np.array(reddits))

reddits_list = np.array(list(all_reddits))

# 5s to get this far

rows = np.zeros((entry_count,))
cols = np.zeros((entry_count,))
data =  np.ones((entry_count,))
i=0
user_idx = 0
for row in row_list:
    for reddit_idx in np.nonzero(np.in1d(reddits_list,row))[0]:
        cols[i] = user_idx
        rows[i] = reddit_idx
        i+=1
    user_idx+=1
adj = csr_matrix( (data,(rows,cols)), shape=(len(reddits_list), len(row_list)) )

似乎很难相信这个速度可以达到这个目的......将82MB文件加载到列表中需要花费5秒,但构建稀疏矩阵需要花费200倍。我该怎么做才能加快速度呢?是否有一些文件格式可以在不到20分钟的时间内将此CSV转换为导入速度更快的文件格式?我在这里做的一些明显昂贵的操作是不是很好?我已经尝试构建一个密集矩阵,我尝试创建一个lil_matrix和一个dok_matrix,并一次分配一个1,这不会更快。

2 个答案:

答案 0 :(得分:2)

无法入睡,尝试了最后一件事......我最终以这种方式将它降到10秒,最后:

import numpy as np
from scipy.sparse import csr_matrix 

user_ids = []
subreddit_ids = []
subreddits = {}
i=0
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for line in f:
        for sr in line.rstrip().split(",")[1:]: 
            if sr not in subreddits: 
                subreddits[sr] = len(subreddits)
            user_ids.append(i)
            subreddit_ids.append(subreddits[sr])
        i+=1

adj = csr_matrix( 
    ( np.ones((len(userids),)), (np.array(subreddit_ids),np.array(user_ids)) ), 
    shape=(len(subreddits), i) )

答案 1 :(得分:1)

首先,您可以在内部for中替换为:

reddit_idx = np.nonzero(np.in1d(reddits_list,row))[0]
sl = slice(i,i+len(reddit_idx))
cols[sl] = user_idx
rows[sl] = reddit_idx
i = sl.stop

使用nonzero(in1d())查找匹配项看起来不错,但我没有探索替代方案。通过切片分配的替代方法是extend列表,但这可能更慢,尤其是对于许多行。

构造行,cols是迄今为止最慢的部分。对csr_matrix的调用很轻微。

由于行(用户)比subreddits多得多,因此可能值得为每个subreddit收集用户ID列表。您已经在集合中收集subreddits。相反,您可以将它们收集在默认字典中,然后从中构建矩阵。当你的3行测试重复100000次时,它明显更快。

from collections import defaultdict
red_dict = defaultdict(list)
user_idx = 0
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for x in f:
        pieces = x.rstrip().split(",")
        user = pieces[0]
        reddits = pieces[1:]
        for r in reddits:
            red_dict[r] += [user_idx]
        user_idx += 1

print 'done 2nd'
x =  red_dict.values()
adj1 = sparse.lil_matrix((len(x), user_idx), dtype=int)
for i,j in enumerate(x):
    adj1[i,j] = 1