Question

我有一个包含3列的csv文件（user_id，item_id，score）。我想创建一个矩阵，其中user_id为行，item_id为列，分数为相应的条目。这样做的目的是在矩阵上进行机器学习分析。有180K行，下面的代码大约需要2.5分钟。如何让它更快？大约有1K个唯一的user_id和9K个唯一的项ID。 user_id和item_id是长整数，得分是1到5。

location <- "data.csv"
data <- read.csv(location, header = TRUE)

user_id <- data[,1]
item_id <- data[,2]
score <- data[,3]


unique_user_id <- unique(unlist(user_id))
unique_item_id <- unique(unlist(item_id))

user_item <- matrix(0,nrow=length(unique_user_id), ncol=length(unique_item_id))

for (i in 1:nrow(data)){
  row <- match(data[i,1],unique_user_id)
  col <- match(data[i,2],unique_item_id)
  user_item[row,col] <- data[i,3]
}

示例输入：

user_id      item_id    score
 10000001     101          1
 10000001     102          2
 10000002     103          2
 10000001     104          3

示例输出

       1      2    3    4      
  1    1      2         3
  2                2
  3

当然，我需要有两个表将输出行和列索引哈希到原始用户ID和项目ID。任何更好的表示是值得赞赏的，但我确实需要存储（用户，项目，分数）的矩阵形式，如上所述

Answer 1

您可以使用stats包中的reshape函数。

tab<-data.frame(ID=rep(1:25,4),Fact=rep(c('A','B','C','D'),each=25),
                                  Resp=sample(1:5,size=100,replace=T))

> head(tab)
  ID Fact Resp
1  1    A    1
2  2    A    5
3  3    A    2
4  4    A    1
5  5    A    2
6  6    A    4

res<-reshape(tab,idvar='ID',timevar='Fact',direction='wide')

head(res)
> head(res)
  ID Resp.A Resp.B Resp.C Resp.D
1  1      1      3      1      4
2  2      5      4      4      3
3  3      2      2      4      2
4  4      1      4      5      5
5  5      2      2      4      4
6  6      4      2      5      2

如何消除我的R代码中的嵌套循环

1 个答案: