我想将离散(标识符)变量转换为一系列逻辑列,以便我可以将变量用作Logistic回归函数(和其他函数)中的一个特征,在这里我可以混合连续和离散值
我在数据框中有一个因子列,我想将列转换为逻辑值的列(1 ..."级别数")矩阵,例如:
my_labels=c("a","b","c","d","e","f")
my_tally=c(1,1,3,2,3,4,5,1)
my_tally=factor(my_tally, levels=c(1:6), labels=my_labels)
summary(my_tally)
expected_output=c(1,0,0,0,0,0, #1
1,0,0,0,0,0, #1
0,0,1,0,0,0, #3
0,1,0,0,0,0, #2
0,0,1,0,0,0, #3
0,0,0,1,0,0, #4
0,0,0,0,1,0, #5
1,0,0,0,0,0 #1
)
expected_output=matrix(expected_output,
nrow=length(my_tally),
ncol=length(levels(my_tally)),
byrow=TRUE
)
expected_output
colSums(expected_output)
任何关于"快速"的建议产生expected_output的函数?这是一个大数据问题(700个离散的可能性,1M观测值)。
答案 0 :(得分:4)
Here are 2 solutions, one using base R, which will be faster on smaller data sets, and one using a sparse matrix from the Matrix
package, which will be very fast on larger data sets.
Create the matrix filled with only 0's
mat <- matrix(0, nrow=length(my_tally), ncol=length(levels(my_tally)))
Use indices to assign 1's where appropriate:
mat[cbind(1:length(my_tally), as.numeric(my_tally))] <- 1
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1 0 0 0 0 0
#[2,] 1 0 0 0 0 0
#[3,] 0 0 1 0 0 0
#[4,] 0 1 0 0 0 0
#[5,] 0 0 1 0 0 0
#[6,] 0 0 0 1 0 0
#[7,] 0 0 0 0 1 0
#[8,] 1 0 0 0 0 0
colSums(mat)
#[1] 3 1 2 1 1 0
library(Matrix)
colSums(sparseMatrix(i=1:length(my_tally), j=as.numeric(my_tally),
dims=c(length(my_tally), length(levels(my_tally)))))
#[1] 3 1 2 1 1 0
Here are some benchmarks on a larger sample data set (260 levels, 100,000 elements), where you can really see the benefit of using a sparse matrix:
# Sample data
my_labels <- c(LETTERS, letters, paste0(LETTERS, letters), paste0(letters, LETTERS),
paste0(letters, letters, letters), paste0(LETTERS, LETTERS, LETTERS),
paste0(LETTERS, letters, LETTERS), paste0(letters, LETTERS, letters),
paste0(LETTERS, letters, letters), paste0(letters, LETTERS, LETTERS))
my_tally <- sample(1:260, 100000, replace=TRUE)
my_tally <- factor(my_tally, levels=c(1:260), labels=my_labels)
# Benchmarks
library(microbenchmark)
microbenchmark(
Robert <- colSums(table(1:length(my_tally),my_tally)),
Frank1 <- {mat <- matrix(0, nrow=length(my_tally), ncol=length(levels(my_tally)))
mat[cbind(1:length(my_tally), as.numeric(my_tally))] <- 1
colSums(mat)},
Frank2 <- colSums(sparseMatrix(i=1:length(my_tally), j=as.numeric(my_tally),
dims=c(length(my_tally), length(levels(my_tally))))),
Khashaa <- colSums(diag(length(my_labels))[my_tally, ])
)
lq mean median uq max neval cld
Robert 444.625026 486.130804 461.653480 548.755603 632.1418 100 d
Frank1 328.947431 358.538855 337.136012 360.727606 458.2305 100 c
Frank2 4.241506 8.997434 4.354615 4.519896 135.3001 100 a
Khashaa 224.675094 256.337639 237.905714 260.163725 375.5642 100 b
答案 1 :(得分:2)
Try this:
expected_output<-table(1:length(my_tally),my_tally)
expected_output
colSums(expected_output)
a b c d e f
3 1 2 1 1 0
答案 2 :(得分:0)
Here is a relatively simple solution using an apply
function:
updateOutput <- function(entry, classInput = my_tally){
column <- as.numeric(classInput[entry])
row <- rep(0, length(levels(classInput)))
row[column] <- 1
row
}
expected_output <- t(apply(matrix(1:length(my_tally)), 1, updateOutput))
expected_output