在Julia中将数组拆分为训练和测试集的有效方法是什么?

时间:2016-05-04 19:48:56

标签: arrays optimization machine-learning julia

所以我在Julia运行机器学习算法,机器上的备用内存有限。无论如何,我注意到我在存储库中使用的代码中存在相当大的瓶颈。似乎拆分数组(随机)比从磁盘读取文件花费的时间更长,这似乎突出了代码的低效率。正如我之前所说的,任何加速这个功能的技巧都会非常感激。可以找到原始函数here。由于它是一个简短的功能,我也会在下面发布它。

# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
                       target_percentage=0.10)
    seen_users = Set()
    seen_items = Set()
    training_set = (Rating)[]
    test_set = (Rating)[]
    shuffled = shuffle(ratings)
    for rating in shuffled
        if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
            push!(test_set, rating)
        else
            push!(training_set, rating)
        end
        push!(seen_users, rating.user)
        push!(seen_items, rating.item)
    end
    return training_set, test_set
end

如前所述,无论如何我可以推送数据将不胜感激。我还会注意到我并不需要保留删除重复项的能力,但这将是一个很好的功能。如果这已经在Julia库中实现,我将很高兴知道它。任何利用Julia的并行能力的解决方案的奖励积分!

1 个答案:

答案 0 :(得分:3)

这是我在内存方面可以提出的最有效的代码。

function splitratings(ratings::Array{Rating,1}, target_percentage=0.10)
  N = length(ratings) 
  splitindex = round(Integer, target_percentage * N)
  shuffle!(ratings) #This shuffles in place which avoids the allocation of another array!
  return sub(ratings, splitindex+1:N), sub(ratings, 1:splitindex) #This makes subarrays instead of copying the original array!
end

但是,Julia的文件IO速度非常慢,现在已成为瓶颈。这个算法在170万个元素的阵列上运行大约需要20秒,所以我说它的性能相当高。