从互斥虚拟变量创建分类变量

时间:2020-01-27 20:21:11

标签: r dataframe categorical-data dummy-variable

如何从互斥的虚拟变量(取值为0/1)创建分类变量?

基本上,我正在寻找与该解决方案完全相反的方法:(https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec22/creating-dummies-for-categorical-variables)。

将感谢基本的R解决方案。

例如,我有以下数据:

<header id='header-wrapper' itemscope='itemscope' itemtype='http://schema.org/WPHeader'>
  <b:section id='header2' maxwidgets='1' showaddelement='no'>
    <b:widget id='Header1' locked='true' title='test (Header)' type='Header' version='1'/>  
  </b:section>
</header>
dummy.df <- structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
                        0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 
                        0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), 
            .Dim = c(10L, 4L), 
            .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", "State.VA")))

我想得到以下结果

          State.NJ State.NY State.TX State.VA
     [1,]        1        0        0        0
     [2,]        0        1        0        0
     [3,]        1        0        0        0
     [4,]        0        0        0        1
     [5,]        0        1        0        0
     [6,]        0        0        1        0
     [7,]        1        0        0        0
     [8,]        0        0        0        1
     [9,]        0        0        1        0
    [10,]        0        0        0        1

4 个答案:

答案 0 :(得分:4)

# toy data
df <- data.frame(a = c(1,0,0,0,0), b = c(0,1,0,1,0), c = c(0,0,1,0,1))

df$cat <- apply(df, 1, function(i) names(df)[which(i == 1)])

结果:

> df
  a b c cat
1 1 0 0   a
2 0 1 0   b
3 0 0 1   c
4 0 1 0   b
5 0 0 1   c

要概括起见,您需要使用dfnames(df)部分,但会遇到麻烦。一种选择是制作一个函数,例如

catmaker <- function(data, varnames, catname) {

  data[,catname] <- apply(data[,varnames], 1, function(i) varnames[which(i == 1)])

  return(data)

}

newdf <- catmaker(data = df, varnames = c("a", "b", "c"), catname = "newcat")

函数方法的一个不错的方面是,它对输入到其中的列名称向量中名称顺序的变化具有鲁棒性。也就是说,varnames = c("c", "a", "b")产生的结果与varnames = c("a", "b", "c")相同。

P.S。在我发布此信息后,您添加了一些示例数据。只要您首先将dummy.df转换为数据帧,例如catmaker(data = as.data.frame(dummy.df), varnames = colnames(dummy.df), "State")就可以完成此功能,该函数就可以在您的示例中使用。

答案 1 :(得分:2)

您可以使用tidyr::gather

library(dplyr)
library(tidyr)

as_tibble(dummy.df) %>%  
  mutate(id =1:n()) %>% 
  pivot_longer(., -id, values_to = "Value", 
                  names_to = c("txt","State"), names_sep = "\\.") %>% 
  filter(Value ==1) %>%  select(State)  
#> # A tibble: 10 x 1
#>    State
#>    <chr>
#>  1 NJ   
#>  2 NY   
#>  3 NJ   
#>  4 VA   
#>  5 NY   
#>  6 TX   
#>  7 NJ   
#>  8 VA   
#>  9 TX   
#> 10 VA

答案 2 :(得分:2)

您可以这样做:

states <- names(dummy.df)[max.col(dummy.df)]

或者如您的示例所示,它是一个矩阵,您需要使用colnames()

colnames(dummy.df)[max.col(dummy.df)]

然后只需使用sub()进行清理:

sub(".*\\.", "", states)

"NJ" "NY" "NJ" "VA" "NY" "TX" "NJ" "VA" "TX" "VA"

答案 3 :(得分:1)

编辑:带有您的数据

使用model.matrix进行伪创建和矩阵乘法的一种方法:

dummy.df<-structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 
                      0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 
                      0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), .Dim = c(10L, 4L
                      ), .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", 
                                                  "State.VA")))
level_names <- colnames(dummy.df)

# use matrix multiplication to extract wanted level
res <- dummy.df%*%1:ncol(dummy.df)

# clean up
res <- as.numeric(res)
factor(res, labels = level_names)
#>  [1] State.NJ State.NY State.NJ State.VA State.NY State.TX State.NJ
#>  [8] State.VA State.TX State.VA
#> Levels: State.NJ State.NY State.TX State.VA

一般说明:

# create factor and dummy target y
dfr <- data.frame(vec = gl(n = 3, k = 3, labels = letters[1:3]),
                  y = 1:9)
dfr
#>   vec y
#> 1   a 1
#> 2   a 2
#> 3   a 3
#> 4   b 4
#> 5   b 5
#> 6   b 6
#> 7   c 7
#> 8   c 8
#> 9   c 9
# dummies creation
dfr_dummy <- model.matrix(y ~ 0 + vec, data = dfr)

# use matrix multiplication to extract wanted level
res <- dfr_dummy%*%c(1,2,3)

# clean up
res <- as.numeric(res)
factor(res, labels = letters[1:3])
#> [1] a a a b b b c c c
#> Levels: a b c