在矩阵而不是数据帧上执行for循环

时间:2017-09-12 02:53:07

标签: r matrix dataframe data.table sparse-matrix

我正在执行一个相当复杂的线性回归,包括使用for循环在新列中有条件地创建虚拟变量。到目前为止,我一直在几个数据帧中执行此操作,将它们转换为矩阵,然后将它们转换为稀疏矩阵,然后加入;但是,我达到了计算机的极限。对不起,如果这让人感到困惑 - 我尽可能地简化了这个过程。

编辑 - 将所有数字示例添加到原始问题中。

以下是包含所有数值的源数据:

df <- data.frame(matrix(nrow = 9, ncol = 4))
df$X1 <- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
df$X2 <- c(10001, 10001, 10001, 10003, 10003, 10003, 10002, 10002, 10002) 
df$X3 <- c(10002, 10002, 10002, 10001, 10001, 10001, 10003, 10003, 10003) 
df$X4 <- c(10001, 10001, 10001, 10003, 10003, 10003, 10002, 10002, 10002)
names(df) <- c("response", "group_1", "group_2", "exclude")

看起来如何:

  response group_1 group_2 exclude
1        5   10001   10002   10001
2        1   10001   10002   10001
3        2   10001   10002   10001
4        0   10003   10001   10003
5        4   10003   10001   10003
6        8   10003   10001   10003
7        7   10002   10003   10002
8        6   10002   10003   10002
9        0   10002   10003   10002

源数据(请参阅上面的编辑):

df <- data.frame(matrix(nrow = 9, ncol = 4))
df$X1 <- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
df$X2 <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green") 
df$X3 <- c("green", "green", "green", "blue", "blue", "blue", "yellow", "yellow", "yellow") 
df$X4 <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
names(df) <- c("response", "group_1", "group_2", "exclude") 

这是数据的简化版本:

  response group_1 group_2 exclude
1        5    blue   green    blue
2        1    blue   green    blue
3        2    blue   green    blue
4        0  yellow    blue  yellow
5        4  yellow    blue  yellow
6        8  yellow    blue  yellow
7        7   green  yellow   green
8        6   green  yellow   green
9        0   green  yellow   green

根据以上数据,我使用以下函数在“group_1”和“group_2”中找到唯一变量:

fun_names <- function(x) {
  row1 <- unique(x$group_1)
  row2 <- unique(x$group_2)
  mat <- data.frame(matrix(nrow = length(row1) + length(row2), ncol = 1))
  mat[1] <- c(row1, row2)
  mat_unique <- data.frame(mat[!duplicated(mat[,1]), ])
  names(mat_unique) <- c("ID")

  return(mat_unique)
}
df_unique <- fun_names(df)

返回以下数据框:

      ID
1   blue
2 yellow
3  green

然后,对于每种颜色(“ID”),如果颜色在每一行并且颜色与“exclude”列值不匹配,我创建一个值为1的新列。循环看起来像这样:

for(name in df_unique$ID) {
  df[paste(name)] <- 
    ifelse(df$group_1 == name & df$exclude != name | 
           df$group_2 == name & df$exclude != name, 1, 0)
}

运行此循环将返回最终data.frame,如下所示:

修改 这是数字数据final df:

  response group_1 group_2 exclude 10001 10003 10002
1        5   10001   10002   10001     0     0     1
2        1   10001   10002   10001     0     0     1
3        2   10001   10002   10001     0     0     1
4        0   10003   10001   10003     1     0     0
5        4   10003   10001   10003     1     0     0
6        8   10003   10001   10003     1     0     0
7        7   10002   10003   10002     0     1     0
8        6   10002   10003   10002     0     1     0
9        0   10002   10003   10002     0     1     0

以下是原始数据:

  response group_1 group_2 exclude blue yellow green
1        5    blue   green    blue    0      0     1
2        1    blue   green    blue    0      0     1
3        2    blue   green    blue    0      0     1
4        0  yellow    blue  yellow    1      0     0
5        4  yellow    blue  yellow    1      0     0
6        8  yellow    blue  yellow    1      0     0
7        7   green  yellow   green    0      1     0
8        6   green  yellow   green    0      1     0
9        0   green  yellow   green    0      1     0

所以,我的问题是:如果原始数据是矩阵(而不是数据帧),我该如何执行此循环?由于循环正在修改数据帧,我需要将该数据帧转换为矩阵,以便将其转换为稀疏矩阵 - 对于我的机器,此data.framedata.matrix转换过于密集。

我已将代码中的所有内容转换为上述for循环到矩阵表示法,但我无法弄清楚如何在修改R中的矩阵时以这种方式打印新列(而不是数据框)。基本上,我希望有人可以帮我修改for循环,以便它可以在矩阵上工作。有没有人有任何建议?

修改 我忘了提到源数据需要保留它的分组 -  group_by(response, group_1, group_2, exclude)。此外,df对象需要以矩阵形式开始,以移除data.framedata.matrix转换。

EDIT2 我没有提到这一点,但在运行整个过程之前,所有数据都被编入索引并转换为数值。因此,示例中的df对象实际上只是数字。

3 个答案:

答案 0 :(得分:1)

这对你的矩阵来说太强烈了吗?它使用dplyrtidyr来完全取消for循环:

library(dplyr)
library(tidyr)

m = df %>% 
    mutate(group = ifelse(group_1 == exclude, group_2, group_1), ones = 1) %>%
    select(response, group, ones) %>%
    spread(key = group, value = ones, fill = 0) %>%
    as.matrix

答案 1 :(得分:1)

所以我开始使用这样的矩阵:

m <- matrix(nrow = 9, ncol = 4)
m[,1]<- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
m[,2] <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green") 
m[,3] <- c("green", "green", "green", "blue", "blue", "blue", "yellow", "yellow", "yellow") 
m[,4] <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
colnames(m) <- c("response", "group_1", "group_2", "exclude")

>m
 #    response group_1  group_2  exclude 
 #[1,] "5"      "blue"   "green"  "blue"  
 #[2,] "1"      "blue"   "green"  "blue"  
 #[3,] "2"      "blue"   "green"  "blue"  
 #[4,] "0"      "yellow" "blue"   "yellow"
 #[5,] "4"      "yellow" "blue"   "yellow"
 #[6,] "8"      "yellow" "blue"   "yellow"
 #[7,] "7"      "green"  "yellow" "green" 
 #[8,] "6"      "green"  "yellow" "green" 
 #[9,] "0"      "green"  "yellow" "green"

使用包 dummies' dummy()功能:

one_hot_encoded_vars = dummy(x="group_2", data = m))
>one_hot_encoded_vars
 #        group_2blue group_2green group_2yellow
 #[1,]           0            1             0
 #[2,]           0            1             0
 #[3,]           0            1             0
 #[4,]           1            0             0
 #[5,]           1            0             0
 #[6,]           1            0             0
 #[7,]           0            0             1
 #[8,]           0            0             1
 #[9,]           0            0             1

创建包含所有变量的数字矩阵:

finalmatrix = cbind(as.numeric(m[,'response']),dummy(x = 'group_1',data = m),
    dummy(x = 'group_2',data = m),dummy(x = 'exclude',data=m))

>finalmatrix
#             group_1blue group_1green group_1yellow group_2blue group_2green group_2yellow excludeblue excludegreen
 #[1,] 5           1            0             0           0            1             0           1            0
 #[2,] 1           1            0             0           0            1             0           1            0
 #[3,] 2           1            0             0           0            1             0           1            0
 #[4,] 0           0            0             1           1            0             0           0            0
 #[5,] 4           0            0             1           1            0             0           0            0
 #[6,] 8           0            0             1           1            0             0           0            0
 #[7,] 7           0            1             0           0            0             1           0            1
 #[8,] 6           0            1             0           0            0             1           0            1
 #[9,] 0           0            1             0           0            0             1           0            1
 #         excludeyellow
 #[1,]             0
 #[2,]             0
 #[3,]             0
 #[4,]             1
 #[5,]             1
 #[6,]             1
 #[7,]             0
 #[8,]             0
 #[9,]             0

如果您想保留群组信息,可以:

 finalmatrix = cbind(m, finalmatrix)

但是finalmatrix将是字符类型对象。

答案 2 :(得分:1)

使用稀疏矩阵进行虚拟编码:

m <- as.matrix(df)

groups <- unique(as.vector(m[, grep("group", colnames(m))]))
tmp <- lapply(groups, function(x, m) 
  which((m[, "group_1"] == x | m[, "group_2"] == x) & m[, "exclude"] != x),
       m = m)

j = rep(seq_along(tmp), lengths(tmp))
i = unlist(tmp)

library(Matrix)
dummies <- sparseMatrix(i, j, dims = c(nrow(m), length(groups)))
colnames(dummies) <- groups

M <- Matrix(as.matrix(df))
cbind(M, dummies)
#9 x 7 Matrix of class "dgeMatrix"
#     response group_1 group_2 exclude 10001 10003 10002
#[1,]        5   10001   10002   10001     0     0     1
#[2,]        1   10001   10002   10001     0     0     1
#[3,]        2   10001   10002   10001     0     0     1
#[4,]        0   10003   10001   10003     1     0     0
#[5,]        4   10003   10001   10003     1     0     0
#[6,]        8   10003   10001   10003     1     0     0
#[7,]        7   10002   10003   10002     0     1     0
#[8,]        6   10002   10003   10002     0     1     0
#[9,]        0   10002   10003   10002     0     1     0