我正在处理超过230个变量的数据集,其中我有大约60个分类var,超过6个6级(无法进行优先排序,例如:Color)
我的问题是关于任何可以帮助我重新编码这些变量的函数,而不需要手工完成,这需要大量的工作和时间,有很多错误的风险!
我可以使用 R 和 python ,因此请随时提出可以完成此任务的最有效功能。
比方说,我有一个名为df
的数据集,而且一组因子列是
clm=(clm1, clm2,clm3,....,clm60)
所有这些都是具有很多级别的因素:
(min=2, max=not important [may be 10, 30 or 100...etc])
非常感谢您的帮助!
答案 0 :(得分:3)
以下是一个使用model.matrix
的简短示例,可以帮助您入门:
df <- data.frame(
clm1 = gl(2, 6, 12, c("clm1.levelA", "clm1.levelB")),
clm2 = gl(3, 4, 12, c("clm2.levelA", "clm2.levelB", "clm2.levelC")));
# clm1 clm2
#1 clm1.levelA clm2.levelA
#2 clm1.levelA clm2.levelA
#3 clm1.levelA clm2.levelA
#4 clm1.levelA clm2.levelA
#5 clm1.levelA clm2.levelB
#6 clm1.levelA clm2.levelB
#7 clm1.levelB clm2.levelB
#8 clm1.levelB clm2.levelB
#9 clm1.levelB clm2.levelC
#10 clm1.levelB clm2.levelC
#11 clm1.levelB clm2.levelC
#12 clm1.levelB clm2.levelC
as.data.frame.matrix(model.matrix(rep(0, nrow(df)) ~ 0 + clm1 + clm2, df));
# clm1clm1.levelA clm1clm1.levelB clm2clm2.levelB clm2clm2.levelC
#1 1 0 0 0
#2 1 0 0 0
#3 1 0 0 0
#4 1 0 0 0
#5 1 0 1 0
#6 1 0 1 0
#7 0 1 1 0
#8 0 1 1 0
#9 0 1 0 1
#10 0 1 0 1
#11 0 1 0 1
#12 0 1 0 1
答案 1 :(得分:0)
在pandas
中使用python3
,您可以执行以下操作:
import pandas as pd
df = pd.DataFrame({'clm1': ['clm1a', 'clm1b', 'clm1c'], 'clm2': ['clm2a', 'clm2b', 'clm2c']})
pd.get_dummies(df)
有关更多示例,请参阅documentation。
答案 2 :(得分:0)
在R中,@ Maurits Evers提出的model.matrix方法的问题是除第一个因子外,该函数降低了每个因子的第一个级别。有时这是你想要的,但有时它不是(取决于@Maurits Evers强调的问题)。
有几个功能分散在不同的包中(例如包caret
,请参阅here了几个例子)。
我使用@Jaap
的Stack Overflow answer启发的以下功能#'
#' Transform factors from a data.frame into dummy variables (one hot encoding)
#'
#' This function will transform all factors into dummy variables with one column
#' for each level of the factor (unlike the contrasts matrices that will drop the first
#' level). The factors with only two levels will have only one column (0/1 on the second
#' level). The ordered factors and logicals are transformed into numeric.
#' The numeric and text vectors will remain untouched.
#'
make_dummies <- function(df){
# function to create dummy variables for one factor only
dummy <- function(fac, name = "") {
if(is.factor(fac) & !is.ordered(fac)) {
l <- levels(fac)
res <- outer(fac, l, function(fac, l) 1L * (fac == l))
colnames(res) <- paste0(name, l)
if(length(l) == 2) {res <- res[,-1, drop = F]}
if(length(l) == 1) {res <- res}
} else if(is.ordered(fac) | is.logical(fac)) {
res <- as.numeric(fac)
} else {
res <- fac
}
return(res)
}
# Apply this function to all columns
res <- (lapply(df, dummy))
# change the names of the cases with only one column
for(i in seq_along(res)){
if(any(is.matrix(res[[i]]) & ncol(res[[i]]) == 1)){
colnames(res[[i]]) <- paste0(names(res)[i], ".", colnames(res[[i]]))
}
}
res <- as.data.frame(res)
return(res)
}
示例:
df <- data.frame(num = round(rnorm(12),1),
sex = factor(c("Male", "Female")),
color = factor(c("black", "red", "yellow")),
fac2 = factor(1:4),
fac3 = factor("A"),
size = factor(c("small", "middle", "big"),
levels = c("small", "middle", "big"), ordered = TRUE),
logi = c(TRUE, FALSE))
print(df)
#> num sex color fac2 fac3 size logi
#> 1 0.0 Male black 1 A small TRUE
#> 2 -1.0 Female red 2 A middle FALSE
#> 3 1.3 Male yellow 3 A big TRUE
#> 4 1.4 Female black 4 A small FALSE
#> 5 -0.9 Male red 1 A middle TRUE
#> 6 0.1 Female yellow 2 A big FALSE
#> 7 1.4 Male black 3 A small TRUE
#> 8 0.1 Female red 4 A middle FALSE
#> 9 1.6 Male yellow 1 A big TRUE
#> 10 1.1 Female black 2 A small FALSE
#> 11 0.2 Male red 3 A middle TRUE
#> 12 0.3 Female yellow 4 A big FALSE
make_dummies(df)
#> num sex.Male color.black color.red color.yellow fac2.1 fac2.2 fac2.3
#> 1 0.0 1 1 0 0 1 0 0
#> 2 -1.0 0 0 1 0 0 1 0
#> 3 1.3 1 0 0 1 0 0 1
#> 4 1.4 0 1 0 0 0 0 0
#> 5 -0.9 1 0 1 0 1 0 0
#> 6 0.1 0 0 0 1 0 1 0
#> 7 1.4 1 1 0 0 0 0 1
#> 8 0.1 0 0 1 0 0 0 0
#> 9 1.6 1 0 0 1 1 0 0
#> 10 1.1 0 1 0 0 0 1 0
#> 11 0.2 1 0 1 0 0 0 1
#> 12 0.3 0 0 0 1 0 0 0
#> fac2.4 fac3.A size logi
#> 1 0 1 1 1
#> 2 0 1 2 0
#> 3 0 1 3 1
#> 4 1 1 1 0
#> 5 0 1 2 1
#> 6 0 1 3 0
#> 7 0 1 1 1
#> 8 1 1 2 0
#> 9 0 1 3 1
#> 10 0 1 1 0
#> 11 0 1 2 1
#> 12 1 1 3 0
由reprex package(v0.2.0)创建于2018-03-19。