我正在执行一个相当复杂的线性回归,包括使用for循环在新列中有条件地创建虚拟变量。到目前为止,我一直在几个数据帧中执行此操作,将它们转换为矩阵,然后将它们转换为稀疏矩阵,然后加入;但是,我达到了计算机的极限。对不起,如果这让人感到困惑 - 我尽可能地简化了这个过程。
编辑 - 将所有数字示例添加到原始问题中。
以下是包含所有数值的源数据:
df <- data.frame(matrix(nrow = 9, ncol = 4))
df$X1 <- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
df$X2 <- c(10001, 10001, 10001, 10003, 10003, 10003, 10002, 10002, 10002)
df$X3 <- c(10002, 10002, 10002, 10001, 10001, 10001, 10003, 10003, 10003)
df$X4 <- c(10001, 10001, 10001, 10003, 10003, 10003, 10002, 10002, 10002)
names(df) <- c("response", "group_1", "group_2", "exclude")
看起来如何:
response group_1 group_2 exclude
1 5 10001 10002 10001
2 1 10001 10002 10001
3 2 10001 10002 10001
4 0 10003 10001 10003
5 4 10003 10001 10003
6 8 10003 10001 10003
7 7 10002 10003 10002
8 6 10002 10003 10002
9 0 10002 10003 10002
源数据(请参阅上面的编辑):
df <- data.frame(matrix(nrow = 9, ncol = 4))
df$X1 <- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
df$X2 <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
df$X3 <- c("green", "green", "green", "blue", "blue", "blue", "yellow", "yellow", "yellow")
df$X4 <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
names(df) <- c("response", "group_1", "group_2", "exclude")
这是数据的简化版本:
response group_1 group_2 exclude
1 5 blue green blue
2 1 blue green blue
3 2 blue green blue
4 0 yellow blue yellow
5 4 yellow blue yellow
6 8 yellow blue yellow
7 7 green yellow green
8 6 green yellow green
9 0 green yellow green
根据以上数据,我使用以下函数在“group_1”和“group_2”中找到唯一变量:
fun_names <- function(x) {
row1 <- unique(x$group_1)
row2 <- unique(x$group_2)
mat <- data.frame(matrix(nrow = length(row1) + length(row2), ncol = 1))
mat[1] <- c(row1, row2)
mat_unique <- data.frame(mat[!duplicated(mat[,1]), ])
names(mat_unique) <- c("ID")
return(mat_unique)
}
df_unique <- fun_names(df)
返回以下数据框:
ID
1 blue
2 yellow
3 green
然后,对于每种颜色(“ID”),如果颜色在每一行并且颜色与“exclude”列值不匹配,我创建一个值为1的新列。循环看起来像这样:
for(name in df_unique$ID) {
df[paste(name)] <-
ifelse(df$group_1 == name & df$exclude != name |
df$group_2 == name & df$exclude != name, 1, 0)
}
运行此循环将返回最终data.frame
,如下所示:
修改 这是数字数据final df:
response group_1 group_2 exclude 10001 10003 10002
1 5 10001 10002 10001 0 0 1
2 1 10001 10002 10001 0 0 1
3 2 10001 10002 10001 0 0 1
4 0 10003 10001 10003 1 0 0
5 4 10003 10001 10003 1 0 0
6 8 10003 10001 10003 1 0 0
7 7 10002 10003 10002 0 1 0
8 6 10002 10003 10002 0 1 0
9 0 10002 10003 10002 0 1 0
以下是原始数据:
response group_1 group_2 exclude blue yellow green
1 5 blue green blue 0 0 1
2 1 blue green blue 0 0 1
3 2 blue green blue 0 0 1
4 0 yellow blue yellow 1 0 0
5 4 yellow blue yellow 1 0 0
6 8 yellow blue yellow 1 0 0
7 7 green yellow green 0 1 0
8 6 green yellow green 0 1 0
9 0 green yellow green 0 1 0
所以,我的问题是:如果原始数据是矩阵(而不是数据帧),我该如何执行此循环?由于循环正在修改数据帧,我需要将该数据帧转换为矩阵,以便将其转换为稀疏矩阵 - 对于我的机器,此data.frame
到data.matrix
转换过于密集。
我已将代码中的所有内容转换为上述for
循环到矩阵表示法,但我无法弄清楚如何在修改R中的矩阵时以这种方式打印新列(而不是数据框)。基本上,我希望有人可以帮我修改for
循环,以便它可以在矩阵上工作。有没有人有任何建议?
修改
我忘了提到源数据需要保留它的分组 -
group_by(response, group_1, group_2, exclude)
。此外,df
对象需要以矩阵形式开始,以移除data.frame
到data.matrix
转换。
EDIT2
我没有提到这一点,但在运行整个过程之前,所有数据都被编入索引并转换为数值。因此,示例中的df
对象实际上只是数字。
答案 0 :(得分:1)
这对你的矩阵来说太强烈了吗?它使用dplyr
和tidyr
来完全取消for循环:
library(dplyr)
library(tidyr)
m = df %>%
mutate(group = ifelse(group_1 == exclude, group_2, group_1), ones = 1) %>%
select(response, group, ones) %>%
spread(key = group, value = ones, fill = 0) %>%
as.matrix
答案 1 :(得分:1)
所以我开始使用这样的矩阵:
m <- matrix(nrow = 9, ncol = 4)
m[,1]<- c(5, 1, 2, 0, 4, 8, 7, 6, 0)
m[,2] <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
m[,3] <- c("green", "green", "green", "blue", "blue", "blue", "yellow", "yellow", "yellow")
m[,4] <- c("blue", "blue", "blue", "yellow", "yellow", "yellow", "green", "green", "green")
colnames(m) <- c("response", "group_1", "group_2", "exclude")
>m
# response group_1 group_2 exclude
#[1,] "5" "blue" "green" "blue"
#[2,] "1" "blue" "green" "blue"
#[3,] "2" "blue" "green" "blue"
#[4,] "0" "yellow" "blue" "yellow"
#[5,] "4" "yellow" "blue" "yellow"
#[6,] "8" "yellow" "blue" "yellow"
#[7,] "7" "green" "yellow" "green"
#[8,] "6" "green" "yellow" "green"
#[9,] "0" "green" "yellow" "green"
使用包 dummies' dummy()
功能:
one_hot_encoded_vars = dummy(x="group_2", data = m))
>one_hot_encoded_vars
# group_2blue group_2green group_2yellow
#[1,] 0 1 0
#[2,] 0 1 0
#[3,] 0 1 0
#[4,] 1 0 0
#[5,] 1 0 0
#[6,] 1 0 0
#[7,] 0 0 1
#[8,] 0 0 1
#[9,] 0 0 1
创建包含所有变量的数字矩阵:
finalmatrix = cbind(as.numeric(m[,'response']),dummy(x = 'group_1',data = m),
dummy(x = 'group_2',data = m),dummy(x = 'exclude',data=m))
>finalmatrix
# group_1blue group_1green group_1yellow group_2blue group_2green group_2yellow excludeblue excludegreen
#[1,] 5 1 0 0 0 1 0 1 0
#[2,] 1 1 0 0 0 1 0 1 0
#[3,] 2 1 0 0 0 1 0 1 0
#[4,] 0 0 0 1 1 0 0 0 0
#[5,] 4 0 0 1 1 0 0 0 0
#[6,] 8 0 0 1 1 0 0 0 0
#[7,] 7 0 1 0 0 0 1 0 1
#[8,] 6 0 1 0 0 0 1 0 1
#[9,] 0 0 1 0 0 0 1 0 1
# excludeyellow
#[1,] 0
#[2,] 0
#[3,] 0
#[4,] 1
#[5,] 1
#[6,] 1
#[7,] 0
#[8,] 0
#[9,] 0
如果您想保留群组信息,可以:
finalmatrix = cbind(m, finalmatrix)
但是finalmatrix
将是字符类型对象。
答案 2 :(得分:1)
使用稀疏矩阵进行虚拟编码:
m <- as.matrix(df)
groups <- unique(as.vector(m[, grep("group", colnames(m))]))
tmp <- lapply(groups, function(x, m)
which((m[, "group_1"] == x | m[, "group_2"] == x) & m[, "exclude"] != x),
m = m)
j = rep(seq_along(tmp), lengths(tmp))
i = unlist(tmp)
library(Matrix)
dummies <- sparseMatrix(i, j, dims = c(nrow(m), length(groups)))
colnames(dummies) <- groups
M <- Matrix(as.matrix(df))
cbind(M, dummies)
#9 x 7 Matrix of class "dgeMatrix"
# response group_1 group_2 exclude 10001 10003 10002
#[1,] 5 10001 10002 10001 0 0 1
#[2,] 1 10001 10002 10001 0 0 1
#[3,] 2 10001 10002 10001 0 0 1
#[4,] 0 10003 10001 10003 1 0 0
#[5,] 4 10003 10001 10003 1 0 0
#[6,] 8 10003 10001 10003 1 0 0
#[7,] 7 10002 10003 10002 0 1 0
#[8,] 6 10002 10003 10002 0 1 0
#[9,] 0 10002 10003 10002 0 1 0