所以我在Julia运行机器学习算法,机器上的备用内存有限。无论如何,我注意到我在存储库中使用的代码中存在相当大的瓶颈。似乎拆分数组(随机)比从磁盘读取文件花费的时间更长,这似乎突出了代码的低效率。正如我之前所说的,任何加速这个功能的技巧都会非常感激。可以找到原始函数here。由于它是一个简短的功能,我也会在下面发布它。
# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
target_percentage=0.10)
seen_users = Set()
seen_items = Set()
training_set = (Rating)[]
test_set = (Rating)[]
shuffled = shuffle(ratings)
for rating in shuffled
if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
push!(test_set, rating)
else
push!(training_set, rating)
end
push!(seen_users, rating.user)
push!(seen_items, rating.item)
end
return training_set, test_set
end
如前所述,无论如何我可以推送数据将不胜感激。我还会注意到我并不需要保留删除重复项的能力,但这将是一个很好的功能。如果这已经在Julia库中实现,我将很高兴知道它。任何利用Julia的并行能力的解决方案的奖励积分!
答案 0 :(得分:3)
这是我在内存方面可以提出的最有效的代码。
function splitratings(ratings::Array{Rating,1}, target_percentage=0.10)
N = length(ratings)
splitindex = round(Integer, target_percentage * N)
shuffle!(ratings) #This shuffles in place which avoids the allocation of another array!
return sub(ratings, splitindex+1:N), sub(ratings, 1:splitindex) #This makes subarrays instead of copying the original array!
end
但是,Julia的文件IO速度非常慢,现在已成为瓶颈。这个算法在170万个元素的阵列上运行大约需要20秒,所以我说它的性能相当高。