我有这种数据。
library(dplyr)
glimpse(samp)
Observations: 5
Variables: 5
$ review_count <int> 68, 3, 7, 9, 5
$ Alcohol <fct> full_bar, NA, full_bar, beer_and_wi...
$ BikeParking <fct> True, NA, False, NA, NA
$ BusinessAcceptsBitcoin <fct> NA, NA, NA, NA, NA
$ BusinessAcceptsCreditCards <fct> True, NA, NA, True, True
我想创建1-p虚拟功能。 createDummyFeatures
软件包的mlr
函数可以选择reference
来完成此操作。
library(mlr)
dummy = createDummyFeatures(samp, target = "review_count", method = "reference")
问题是它没有保留原始的列名。
glimpse(dummy)
Observations: 5
Variables: 6
$ review_count <int> 68, 3, 7, 9, 5
$ Alcohol.full_bar <dbl> 1, NA, 1, 0, NA
$ Alcohol.none <dbl> 0, NA, 0, 0, NA
$ True <dbl> 1, NA, 0, NA, NA
$ True.1 <dbl> NA, NA, NA, NA, NA
$ True.2 <dbl> 1, NA, NA, 1, 1
问题是我如何保留它们?
一个想法是通过1-of-n
方法创建它们,然后删除所有包含“ False”的列。
dummy2 = createDummyFeatures(samp, target = "review_count")
dummy2 = dummy2 %>%
select(-contains("False"))
glimpse(dummy2)
Observations: 5
Variables: 7
$ review_count <int> 68, 3, 7, 9, 5
$ Alcohol.beer_and_wine <dbl> 0, NA, 0, 1, NA
$ Alcohol.full_bar <dbl> 1, NA, 1, 0, NA
$ Alcohol.none <dbl> 0, NA, 0, 0, NA
$ BikeParking.True <dbl> 1, NA, 0, NA, NA
$ BusinessAcceptsBitcoin.True <dbl> NA, NA, NA, NA, NA
$ BusinessAcceptsCreditCards.True <dbl> 1, NA, NA, 1, 1
但是,我不知道它是否与n-1相同,尤其是对于具有2个以上级别的因子而言(虚拟编码用于XGBoost回归,其中“评论计数”是目标变量)。
dput(samp)
structure(list(review_count = c(68L, 3L, 7L, 9L, 5L), Alcohol = structure(c(2L,
NA, 2L, 1L, NA), .Label = c("beer_and_wine", "full_bar", "none"
), class = "factor"), BikeParking = structure(c(2L, NA, 1L, NA,
NA), .Label = c("False", "True"), class = "factor"), BusinessAcceptsBitcoin = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c("False",
"True"), class = "factor"), BusinessAcceptsCreditCards = structure(c(2L,
NA, NA, 2L, 2L), .Label = c("False", "True"), class = "factor")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -5L))
修改
对于那些有相同问题的人,我使用caret
解决了这个问题。
library(caret)
dummy_dat = dummyVars("~ .", data = samp, fullRank = T)
dat = data.frame(predict(dummy_dat, newdata = samp))