重新编码没有映射值的分类

时间:2017-08-22 08:40:29

标签: r

获得了包含大量变量的数据框(82),其中许多用于进一步计算。所以我试图转换为数字,但有一个巨大的工作猜测每个变量的不同值,然后分配数字。

我想知道是否有更自动化的方法,因为我不关心将哪个数字分配给任何值,因为它不会重复。

到目前为止我的方法(为了清晰起见,虚拟数据):

df <- data.frame(original.var1 = c("display","memory","software","display","disk","memory"),
original.var2 = c("skeptic","believer","believer","believer","skeptic","believer"),
original.var3 = c("round","square","triangle","cube","sphere","hexagon"),
original.var4 = c(10,20,30,40,50,60))

考虑到这个工作正常

library(dplyr)
library(magrittr)

     df$NEW1 <- as.numeric(interaction(df$original.var1, drop=TRUE))

我试图以这种方式适应dplyr和管道

 df %<>% mutate(VAR1= as.numeric(interaction(original.var1, drop=TRUE))) %>%
            mutate(VAR2= as.numeric(interaction(original.var2, drop=TRUE))) %>%
            mutate(VAR3= as.numeric(interaction(original.var2, drop=TRUE))) 

但是前面的第三个VAR结果出错了

 df %>% dplyr::group_by(original.var1,VAR1) %>% tally()
    # A tibble: 4 x 3
    # Groups:   original.var1 [?]
      original.var1  VAR1     n
             <fctr> <dbl> <int>
    1          disk     1     1
    2       display     2     2
    3        memory     3     2
    4      software     4     1

    > df %>% dplyr::group_by(original.var2,VAR2) %>% tally()
    # A tibble: 2 x 3
    # Groups:   original.var2 [?]
      original.var2  VAR2     n
             <fctr> <dbl> <int>
    1      believer     1     4
    2       skeptic     2     2

    > df %>% dplyr::group_by(original.var3,VAR3) %>% tally()
    # A tibble: 6 x 3
    # Groups:   original.var3 [?]
      original.var3  VAR3     n
             <fctr> <dbl> <int>
    1          cube     1     1
    2       hexagon     1     1
    3         round     2     1
    4        sphere     2     1
    5        square     1     1
    6      triangle     1     1

重新编码的任何方法或包没有先前声明的映射?

2 个答案:

答案 0 :(得分:1)

使用purrr仅保留factor列并对其进行操作。最后用数字合并。

df %>% purrr::keep(is.factor) %>% mutate_all(funs(as.numeric(interaction(., drop = TRUE))))

答案 1 :(得分:1)

您可以使用mutate_if

library(dplyr)

mutate_if(df, is.factor, funs(as.numeric(interaction(., drop = TRUE))))

给出,

  original.var1 original.var2 original.var3 original.var4
1             2             2             3            10
2             3             1             5            20
3             4             1             6            30
4             2             1             1            40
5             1             2             4            50
6             3             1             2            60

或者,您可以使用stringsAsFactors = FALSE阅读您的数据框并使用is.character,但这是相同的事情

要解决您的评论,如果您还想保留原始列,那么

mutate_if(df, is.factor, funs(new = as.numeric(interaction(., drop = TRUE))))