将很多级别的因子变量重新编码为假人?

时间:2018-03-19 10:25:01

标签: python r pandas dataframe machine-learning

我正在处理超过230个变量的数据集,其中我有大约60个分类var,超过6个6级(无法进行优先排序,例如:Color)

我的问题是关于任何可以帮助我重新编码这些变量的函数,而不需要手工完成,这需要大量的工作和时间,有很多错误的风险!

我可以使用 R python ,因此请随时提出可以完成此任务的最有效功能。

比方说,我有一个名为df的数据集,而且一组因子列是

clm=(clm1, clm2,clm3,....,clm60)

所有这些都是具有很多级别的因素:

(min=2, max=not important [may be 10, 30 or 100...etc])

非常感谢您的帮助!

3 个答案:

答案 0 :(得分:3)

以下是一个使用model.matrix的简短示例,可以帮助您入门:

df <- data.frame(
    clm1 = gl(2, 6, 12, c("clm1.levelA", "clm1.levelB")),
    clm2 = gl(3, 4, 12, c("clm2.levelA", "clm2.levelB", "clm2.levelC")));
#          clm1        clm2
#1  clm1.levelA clm2.levelA
#2  clm1.levelA clm2.levelA
#3  clm1.levelA clm2.levelA
#4  clm1.levelA clm2.levelA
#5  clm1.levelA clm2.levelB
#6  clm1.levelA clm2.levelB
#7  clm1.levelB clm2.levelB
#8  clm1.levelB clm2.levelB
#9  clm1.levelB clm2.levelC
#10 clm1.levelB clm2.levelC
#11 clm1.levelB clm2.levelC
#12 clm1.levelB clm2.levelC



as.data.frame.matrix(model.matrix(rep(0, nrow(df)) ~ 0 + clm1 + clm2, df));
#   clm1clm1.levelA clm1clm1.levelB clm2clm2.levelB clm2clm2.levelC
#1                1               0               0               0
#2                1               0               0               0
#3                1               0               0               0
#4                1               0               0               0
#5                1               0               1               0
#6                1               0               1               0
#7                0               1               1               0
#8                0               1               1               0
#9                0               1               0               1
#10               0               1               0               1
#11               0               1               0               1
#12               0               1               0               1

答案 1 :(得分:0)

pandas中使用python3,您可以执行以下操作:

import pandas as pd
df = pd.DataFrame({'clm1': ['clm1a', 'clm1b', 'clm1c'], 'clm2': ['clm2a', 'clm2b', 'clm2c']})
pd.get_dummies(df)

有关更多示例,请参阅documentation

答案 2 :(得分:0)

在R中,@ Maurits Evers提出的model.matrix方法的问题是除第一个因子外,该函数降低了每个因子的第一个级别。有时这是你想要的,但有时它不是(取决于@Maurits Evers强调的问题)。

有几个功能分散在不同的包中(例如包caret,请参阅here了几个例子)。

我使用@Jaap

Stack Overflow answer启发的以下功能
#' 
#' Transform factors from a data.frame into dummy variables (one hot encoding)
#' 
#' This function will transform all factors into dummy variables with one column
#' for each level of the factor (unlike the contrasts matrices that will drop the first
#' level). The factors with only two levels will have only one column (0/1 on the second 
#' level). The ordered factors and logicals are transformed into numeric.
#' The numeric and text vectors will remain untouched.
#' 
make_dummies <- function(df){

    # function to create dummy variables for one factor only
    dummy <- function(fac, name = "") {

        if(is.factor(fac) & !is.ordered(fac)) {
            l <- levels(fac)
            res <- outer(fac, l, function(fac, l) 1L * (fac == l))
            colnames(res) <- paste0(name, l)
            if(length(l) == 2) {res <- res[,-1, drop = F]}
            if(length(l) == 1) {res <- res}
        } else if(is.ordered(fac) | is.logical(fac)) {
            res <- as.numeric(fac)
        } else {
            res <- fac
        }
        return(res)
    }

    # Apply this function to all columns
    res <- (lapply(df, dummy))
    # change the names of the cases with only one column
    for(i in seq_along(res)){
        if(any(is.matrix(res[[i]]) & ncol(res[[i]]) == 1)){
            colnames(res[[i]]) <- paste0(names(res)[i], ".", colnames(res[[i]]))
        }
    }
    res <- as.data.frame(res)
    return(res)
}

示例:

df <- data.frame(num = round(rnorm(12),1),
                 sex = factor(c("Male", "Female")),
                 color = factor(c("black", "red", "yellow")),
                 fac2 = factor(1:4),
                 fac3 = factor("A"),
                 size =  factor(c("small", "middle", "big"),
                                levels = c("small", "middle", "big"), ordered = TRUE),
                 logi = c(TRUE, FALSE))
print(df)
#>     num    sex  color fac2 fac3   size  logi
#> 1   0.0   Male  black    1    A  small  TRUE
#> 2  -1.0 Female    red    2    A middle FALSE
#> 3   1.3   Male yellow    3    A    big  TRUE
#> 4   1.4 Female  black    4    A  small FALSE
#> 5  -0.9   Male    red    1    A middle  TRUE
#> 6   0.1 Female yellow    2    A    big FALSE
#> 7   1.4   Male  black    3    A  small  TRUE
#> 8   0.1 Female    red    4    A middle FALSE
#> 9   1.6   Male yellow    1    A    big  TRUE
#> 10  1.1 Female  black    2    A  small FALSE
#> 11  0.2   Male    red    3    A middle  TRUE
#> 12  0.3 Female yellow    4    A    big FALSE
make_dummies(df)
#>     num sex.Male color.black color.red color.yellow fac2.1 fac2.2 fac2.3
#> 1   0.0        1           1         0            0      1      0      0
#> 2  -1.0        0           0         1            0      0      1      0
#> 3   1.3        1           0         0            1      0      0      1
#> 4   1.4        0           1         0            0      0      0      0
#> 5  -0.9        1           0         1            0      1      0      0
#> 6   0.1        0           0         0            1      0      1      0
#> 7   1.4        1           1         0            0      0      0      1
#> 8   0.1        0           0         1            0      0      0      0
#> 9   1.6        1           0         0            1      1      0      0
#> 10  1.1        0           1         0            0      0      1      0
#> 11  0.2        1           0         1            0      0      0      1
#> 12  0.3        0           0         0            1      0      0      0
#>    fac2.4 fac3.A size logi
#> 1       0      1    1    1
#> 2       0      1    2    0
#> 3       0      1    3    1
#> 4       1      1    1    0
#> 5       0      1    2    1
#> 6       0      1    3    0
#> 7       0      1    1    1
#> 8       1      1    2    0
#> 9       0      1    3    1
#> 10      0      1    1    0
#> 11      0      1    2    1
#> 12      1      1    3    0

reprex package(v0.2.0)创建于2018-03-19。