将列值转换为自己的二进制编码列(虚拟变量)

时间:2015-05-16 20:56:19

标签: r sparse-matrix reshape2

我有许多CSV文件,其中包含性别,年龄,诊断等栏目。

目前,它们的编码如下:

ID, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd

目标是将此数据转换为目标格式

旁注:如果可能的话,最好还将原始列名称添加到新列名称中,例如'age_42'或'gender_female。'

ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1,  1,    0,      1,  0,  0,  0,  1,      1,       0,        0
2,  1,    0,      0,  1,  0,  0,  1,      0,       0,        0
3,  0,    1,      0,  0,  1,  0,  0,      0,       1,        0
4,  0,    1,      0,  0,  0,  1,  0,      0,       1,        1 

我尝试过使用reshape2的dcast()函数,但我得到的组合导致了非常稀疏的矩阵。这是一个只有年龄和性别的简化示例:

data.train  <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)

ID, male19, male23, male42, male61, female19, female23, female42, female61
1,  0,      0,      1,      0,      0,        0,        0,        0
2,  1,      0,      0,      0,      0,        0,        0,        0
3,  0,      0,      0,      0,      0,        1,        0,        0
4,  0,      0,      0,      0,      0,        0,        0,        1   

由于这是机器学习数据准备中相当普遍的任务,我想可能还有其他能够执行此转换的库(我不知道)。

5 个答案:

答案 0 :(得分:8)

您需要text / melt组合(称为dcast)才能将所有列转换为一列并避免组合

recast

根据您的旁注,您可以在此处添加library(reshape2) recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L) # ID 19 23 42 61 anxiety asthma copd diabetes female male # 1 1 0 0 1 0 1 1 0 0 0 1 # 2 2 1 0 0 0 0 1 0 0 0 1 # 3 3 0 1 0 0 0 0 0 1 1 0 # 4 4 0 0 0 1 0 0 1 1 1 0 以便添加名称

variable

答案 1 :(得分:5)

caret包中有一个“dummify”数据函数。

library(caret)
library(dplyr)
predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)

答案 2 :(得分:4)

base R选项

 (!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
 #     values
 #ID  19 23 42 61 anxiety asthma copd diabetes female male
 # 1  0  0  1  0       1      1    0        0      0    1
 # 2  1  0  0  0       0      1    0        0      0    1
 # 3  0  1  0  0       0      0    0        1      1    0
 # 4  0  0  0  1       0      0    1        1      1    0

如果您还需要原始名称

 (!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
 #   Val
 #ID  age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
 #1      0      0      1      0                 1                1
 #2      1      0      0      0                 0                1
 #3      0      1      0      0                 0                0
 #4      0      0      0      1                 0                0
 #  Val
 #ID  diagnosis_copd diagnosis_diabetes gender_female gender_male
 #1              0                  0             0           1
 #2              0                  0             0           1
 #3              0                  1             1           0
 #4              1                  1             1           0

数据

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male", 
"male", "male", "female", "female", "female"), age = c(42L, 42L, 
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma", 
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender", 
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")

答案 3 :(得分:3)

使用基础R中的reshape

d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")


new.dat <- cbind(ID=d["ID"],
    g[,grepl("_", names(g))],
    a[,grepl("_", names(a))],
    d[,grepl("_", names(d))])

# convert factors columns to character (if necessary)
# taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)

new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1

#  ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
#1  1           1             0      1      0      0      0                1
#3  2           1             0      0      1      0      0                1
#4  3           0             1      0      0      1      0                0
#5  4           0             1      0      0      0      1                0
#  diagnosis_anxiety diagnosis_diabetes diagnosis_copd
#1                 1                  0              0
#3                 0                  0              0
#4                 0                  1              0
#5                 0                  1              1

答案 4 :(得分:1)

以下是使用dcast()merge()稍长的方式。由于性别和年龄不是唯一的ID,因此创建一个函数将其长度转换为虚拟变量(dum())。另一方面,通过调整公式将诊断设置为唯一计数。

library(reshape2)
data.raw <- read.table(header = T, sep = ",", text = "
id, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd")

# function to create a dummy variable
dum <- function(x) { if(length(x) > 0) 1 else 0 }

# length of dignosis by id, gender and age
diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)]

# length of gender by id
gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum)

# length of age by id
age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum)

merge(merge(gen, age, by = "id"), diag, by = "id")
#  id   female   male 19 23 42 61   anxiety   asthma   copd   diabetes
#1  1        0      1  0  0  1  0         1        1      0          0
#2  2        0      1  1  0  0  0         0        1      0          0
#3  3        1      0  0  1  0  0         0        0      0          1
#4  4        1      0  0  0  0  1         0        0      1          1

实际上我并不太了解您的模型,但是您的设置可能太多,因为R处理公式对象的因素。例如,如果性别是响应,则在R中将生成以下矩阵。因此,只要您不适合自己,就足以适当地设置数据类型和公式。

data.raw$age <- as.factor(data.raw$age)
model.matrix(gender ~ ., data = data.raw[,-1])
#(Intercept) age23 age42 age61 diagnosis  asthma diagnosis  copd diagnosis  diabetes
#1           1     0     1     0                 1               0                   0
#2           1     0     1     0                 0               0                   0
#3           1     0     0     0                 1               0                   0
#4           1     1     0     0                 0               0                   1
#5           1     0     0     1                 0               0                   1
#6           1     0     0     1                 0               1                   0

如果您需要每个变量的所有级别,您可以通过抑制model.matrix中的拦截并使用来自all-levels-of-a-factor-in-a-model-matrix-in-r的小魔法来实现此目的

#  Using Akrun's df1, first change all variables, except ID, to factor
df1[-1] <- lapply(df1[-1], factor)

# Use model.matrix to derive dummy coding
m <- data.frame(model.matrix( ~ 0 + . , data=df1, 
             contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE)))

# Collapse to give final solution
aggregate(. ~ ID, data=m, max)