我正在尝试为我的推文应用程序实现基于内容的推荐程序,我认为我设法做到了。问题是虽然我的解决方案是如此数据库密集型,它使加载时间太长。所以我来这里寻求帮助。在下一节中,我将发布算法,然后我将继续解释。
def candidates2(user)
@follower_tweet_string = "" ## storing all the text from all the tweets from all the followers that a user has
@rest_of_users_strings ## storing all the text from all the tweets a user, that the current user is not following, has.
scoreHash = Hash.new ## a score hash where the score between the similarities found by the TfIdSimilarity gem are kept
@rezultat = [] ## the array of users returned
@users = User.all ## all the users
@rest_of_users = [] ## all the users that the current user is not following
@following = user.following + Array(user) ## all the user the current user is following + the user
@following.each do |followee|
@tweets = followee.feed ## feed is a method for requesting all the tweets of that person
@tweets.each do |tweet|
@follower_tweet_string = @follower_tweet_string + tweet.content ## getting all the text from all the tweets of all the followers
end
end
@rest_of_users = @users - @following ## finding out all the users that the user is not following
document1 = TfIdfSimilarity::Document.new(@follower_tweet_string)
corpus = [document1]
@rest_of_users.each do |person|
@tweets = person.feed ## getting all the tweets of the user
@tweets.each do |tweet|
@follower_tweet_string = @follower_tweet_string + tweet.content ## getting all the text from all the tweets that a user has(a user that isn't followed by the current user)
end
##calculating the score
document2 = TfIdfSimilarity::Document.new(@follower_tweet_string)
corpus = corpus + Array(document2)
model = TfIdfSimilarity::TfIdfModel.new(corpus)
matrix = model.similarity_matrix
scoreHash[person.email] = matrix[model.document_index(document1), model.document_index(document2)]
corpus = corpus - Array(document2)
## stop calculating the score
end
sortedHash = Hash[scoreHash.sort_by{|email, score| score}.reverse[0..4]] ## sorting the hash
@rest_of_users.each do |rank|
if sortedHash[rank.email] then
@rezultat = @rezultat + Array(rank) ## getting the resulting users
end
end
@rezultat ## returning the resulting users
end
可以在第6页上找到算法here,第3.2章,基于内容的推荐者(20行解释等)。
我的算法的主要问题是我必须接受所有未被跟踪的用户,然后接收所有推文,然后应用算法。这是非常密集的DB,它是疯了。我不能这样做......有什么想法可以改善这个吗?
答案 0 :(得分:1)
您应该将生成建议与显示建议分开。
也就是说,您有一个处理推文的批处理作业,并生成推荐,然后将它们存储在数据库中。这项工作定期开展。
另外,您有一个Web界面,可以在数据库中查询当前建议,然后显示它们。
现在加载时间很快。网络响应时间很快。现在,您的性能问题显示为运行批处理作业的频率。这是一个延迟不是问题的环境,并且可以通过运行并行作业等技术更容易解决。