Question

我有一个广泛的问题，但我会尽力使自己的意图更明确，以便人们提出建议。我正在尝试优化我正在执行的流程。通常，我正在做的是向函数提供值的数据帧，并根据特定列上的操作生成预测。基本上是与sapply一起使用的自定义函数（以下代码）。为了提供有意义的示例，我正在做的事情很大，所以我将尝试描述该过程的输入。我知道这将限制有用的答案，但是我对优化我计算预测所需时间的任何想法都感兴趣。目前，我要花大约10秒的时间来生成一个预测（对数据帧的一行进行运行）。

mean_rating <- function(df){
  user<-df$user
  movie<-df$movie
  u_row<-which(U_lookup == user)[1]
  m_row<-which(M_lookup==movie)[1]

  knn_match<- knn_txt[u_row,1:100]

  knn_match1<-as.numeric(unlist(knn_match))

  dfm_test<- dfm[knn_match1,]

  dfm_mov<- dfm_test[,m_row] # row number from DFM associated with the query_movie




  C<-mean(dfm_mov)

}

test<-sapply(1:nrow(probe_test),function(x) mean_rating(probe_test[x,]))

输入： dfm是我的主要数据矩阵，用户在行中，电影在列中。很稀疏。

> str(dfm)
Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:99072112] 378 1137 1755 1893 2359 3156 3423 4380 5103 6762 ...
  ..@ j       : int [1:99072112] 0 0 0 0 0 0 0 0 0 0 ...
  ..@ Dim     : int [1:2] 480189 17770
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:99072112] 4 5 4 1 4 5 4 5 3 3 ...
  ..@ factors : list()

probe_test是我要测试的测试集。实际的探针测试大约包含140万行，但我首先在一个子集上尝试以优化时间。它被输入到我的函数中。

> str(probe_test)
'data.frame':   6 obs. of  6 variables:
 $ X          : int  1 2 3 4 5 6
 $ movie      : int  1 1 1 1 1 1
 $ user       : int  1027056 1059319 1149588 1283744 1394012 1406595
 $ Rating     : int  3 3 4 3 5 4
 $ Rating_Date: Factor w/ 1929 levels "2000-01-06","2000-01-08",..: 1901 1847 1911 1312 1917 1803
 $ Indicator  : int  1 1 1 1 1 1

U_lookup是我用来在用户ID和用户所在的行之间进行转换的查找，因为当用户ID转换为稀疏矩阵时，我们会丢失它们。

> str(U_lookup)
'data.frame':   480189 obs. of  1 variable:
 $ x: int  10 100000 1000004 1000027 1000033 1000035 1000038 1000051 1000053 1000057 ...

M_lookup是我用来在电影ID和电影所在的矩阵列之间进行转换的查找，其原因与上述类似。

> str(M_lookup)
'data.frame':   17770 obs. of  1 variable:
 $ x: int  1 10 100 1000 10000 10001 10002 10003 10004 10005 ...

knn_text包含dfm所有行的100个最近邻居

> str(knn_txt)
'data.frame':   480189 obs. of  200 variables:

感谢您提供给我的任何建议。

如何优化此过程？

0 个答案: