我想通过r中的一个热编码将因子变量表示为0和1值作为data.frame。
在因子变量中,我想仅对具有三个或更多级别的变量执行一个热编码。
这是我的R代码。
german<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) german[,i]=as.factor(german[,i])
str(german)
'data.frame': 1000 obs. of 21 variables:
$ Creditability : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Account.Balance : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ...
$ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
$ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ...
$ Purpose : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ...
$ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
$ Value.Savings.Stocks : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 1 1 1 1 1 1 3 ...
$ Length.of.current.employment : Factor w/ 5 levels "1","2","3","4",..: 2 3 4 3 3 2 4 2 1 1 ...
$ Instalment.per.cent : Factor w/ 4 levels "1","2","3","4": 4 2 2 3 4 1 1 2 4 1 ...
$ Sex...Marital.Status : Factor w/ 4 levels "1","2","3","4": 2 3 2 3 3 3 3 3 2 2 ...
$ Guarantors : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Duration.in.Current.address : Factor w/ 4 levels "1","2","3","4": 4 2 4 2 4 3 4 4 4 4 ...
$ Most.valuable.available.asset : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 2 1 1 1 3 4 ...
$ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
$ Concurrent.Credits : Factor w/ 3 levels "1","2","3": 3 3 3 3 1 3 3 3 3 3 ...
$ Type.of.apartment : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 1 2 2 2 1 ...
$ No.of.Credits.at.this.Bank : Factor w/ 4 levels "1","2","3","4": 1 2 1 2 2 2 2 1 2 1 ...
$ Occupation : Factor w/ 4 levels "1","2","3","4": 3 3 2 2 2 2 2 2 1 1 ...
$ No.of.dependents : Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 1 ...
$ Telephone : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ Foreign.Worker : Factor w/ 2 levels "1","2": 1 1 1 2 2 2 2 2 1 1 ...
在这里,我想对一个超过3个级别的因子变量进行热编码。
例如,Guarantors变量有3个级别1,2,3。 因此,我想获得Guarantors1,Guarantors2和Guarantors3变量,这些变量只有0.1值作为data.frame。
答案 0 :(得分:2)
dplyr
&amp; purrr
方法
library(dplyr)
library(purrr)
german<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
cols <- c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
map_df(german[, cols], as.factor) %>%
select_if(function(x) nlevels(x) >= 2) %>%
model.matrix(~. -1, data = .) %>%
as.data.frame()
我建议您阅读帮助model.matrix
或other questions from SO on this topic.