如何从互斥的虚拟变量(取值为0/1)创建分类变量?
基本上,我正在寻找与该解决方案完全相反的方法:(https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781787124479/1/01lvl1sec22/creating-dummies-for-categorical-variables)。
将感谢基本的R解决方案。
例如,我有以下数据:
<header id='header-wrapper' itemscope='itemscope' itemtype='http://schema.org/WPHeader'>
<b:section id='header2' maxwidgets='1' showaddelement='no'>
<b:widget id='Header1' locked='true' title='test (Header)' type='Header' version='1'/>
</b:section>
</header>
dummy.df <- structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L),
.Dim = c(10L, 4L),
.Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX", "State.VA")))
我想得到以下结果
State.NJ State.NY State.TX State.VA
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 1 0 0 0
[4,] 0 0 0 1
[5,] 0 1 0 0
[6,] 0 0 1 0
[7,] 1 0 0 0
[8,] 0 0 0 1
[9,] 0 0 1 0
[10,] 0 0 0 1
答案 0 :(得分:4)
# toy data
df <- data.frame(a = c(1,0,0,0,0), b = c(0,1,0,1,0), c = c(0,0,1,0,1))
df$cat <- apply(df, 1, function(i) names(df)[which(i == 1)])
结果:
> df
a b c cat
1 1 0 0 a
2 0 1 0 b
3 0 0 1 c
4 0 1 0 b
5 0 0 1 c
要概括起见,您需要使用df
和names(df)
部分,但会遇到麻烦。一种选择是制作一个函数,例如
catmaker <- function(data, varnames, catname) {
data[,catname] <- apply(data[,varnames], 1, function(i) varnames[which(i == 1)])
return(data)
}
newdf <- catmaker(data = df, varnames = c("a", "b", "c"), catname = "newcat")
函数方法的一个不错的方面是,它对输入到其中的列名称向量中名称顺序的变化具有鲁棒性。也就是说,varnames = c("c", "a", "b")
产生的结果与varnames = c("a", "b", "c")
相同。
P.S。在我发布此信息后,您添加了一些示例数据。只要您首先将dummy.df
转换为数据帧,例如catmaker(data = as.data.frame(dummy.df), varnames = colnames(dummy.df), "State")
就可以完成此功能,该函数就可以在您的示例中使用。
答案 1 :(得分:2)
您可以使用tidyr::gather
:
library(dplyr)
library(tidyr)
as_tibble(dummy.df) %>%
mutate(id =1:n()) %>%
pivot_longer(., -id, values_to = "Value",
names_to = c("txt","State"), names_sep = "\\.") %>%
filter(Value ==1) %>% select(State)
#> # A tibble: 10 x 1
#> State
#> <chr>
#> 1 NJ
#> 2 NY
#> 3 NJ
#> 4 VA
#> 5 NY
#> 6 TX
#> 7 NJ
#> 8 VA
#> 9 TX
#> 10 VA
答案 2 :(得分:2)
您可以这样做:
states <- names(dummy.df)[max.col(dummy.df)]
或者如您的示例所示,它是一个矩阵,您需要使用colnames()
:
colnames(dummy.df)[max.col(dummy.df)]
然后只需使用sub()
进行清理:
sub(".*\\.", "", states)
"NJ" "NY" "NJ" "VA" "NY" "TX" "NJ" "VA" "TX" "VA"
答案 3 :(得分:1)
编辑:带有您的数据
使用model.matrix
进行伪创建和矩阵乘法的一种方法:
dummy.df<-structure(c(1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L), .Dim = c(10L, 4L
), .Dimnames = list(NULL, c("State.NJ", "State.NY", "State.TX",
"State.VA")))
level_names <- colnames(dummy.df)
# use matrix multiplication to extract wanted level
res <- dummy.df%*%1:ncol(dummy.df)
# clean up
res <- as.numeric(res)
factor(res, labels = level_names)
#> [1] State.NJ State.NY State.NJ State.VA State.NY State.TX State.NJ
#> [8] State.VA State.TX State.VA
#> Levels: State.NJ State.NY State.TX State.VA
一般说明:
# create factor and dummy target y
dfr <- data.frame(vec = gl(n = 3, k = 3, labels = letters[1:3]),
y = 1:9)
dfr
#> vec y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 b 4
#> 5 b 5
#> 6 b 6
#> 7 c 7
#> 8 c 8
#> 9 c 9
# dummies creation
dfr_dummy <- model.matrix(y ~ 0 + vec, data = dfr)
# use matrix multiplication to extract wanted level
res <- dfr_dummy%*%c(1,2,3)
# clean up
res <- as.numeric(res)
factor(res, labels = letters[1:3])
#> [1] a a a b b b c c c
#> Levels: a b c