我有许多CSV文件,其中包含性别,年龄,诊断等栏目。
目前,它们的编码如下:
ID, gender, age, diagnosis
1, male, 42, asthma
1, male, 42, anxiety
2, male, 19, asthma
3, female, 23, diabetes
4, female, 61, diabetes
4, female, 61, copd
目标是将此数据转换为目标格式:
旁注:如果可能的话,最好还将原始列名称添加到新列名称中,例如'age_42'或'gender_female。'
ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0
2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0
3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0
4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1
我尝试过使用reshape2的dcast()
函数,但我得到的组合导致了非常稀疏的矩阵。这是一个只有年龄和性别的简化示例:
data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)
ID, male19, male23, male42, male61, female19, female23, female42, female61
1, 0, 0, 1, 0, 0, 0, 0, 0
2, 1, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 1, 0, 0
4, 0, 0, 0, 0, 0, 0, 0, 1
由于这是机器学习数据准备中相当普遍的任务,我想可能还有其他能够执行此转换的库(我不知道)。
答案 0 :(得分:8)
您需要text
/ melt
组合(称为dcast
)才能将所有列转换为一列并避免组合
recast
根据您的旁注,您可以在此处添加library(reshape2)
recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
# ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1 1 0 0 1 0 1 1 0 0 0 1
# 2 2 1 0 0 0 0 1 0 0 0 1
# 3 3 0 1 0 0 0 0 0 1 1 0
# 4 4 0 0 0 1 0 0 1 1 1 0
以便添加名称
variable
答案 1 :(得分:5)
caret
包中有一个“dummify”数据函数。
library(caret)
library(dplyr)
predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)
答案 2 :(得分:4)
base R
选项
(!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
# values
#ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1 0 0 1 0 1 1 0 0 0 1
# 2 1 0 0 0 0 1 0 0 0 1
# 3 0 1 0 0 0 0 0 1 1 0
# 4 0 0 0 1 0 0 1 1 1 0
如果您还需要原始名称
(!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
# Val
#ID age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
#1 0 0 1 0 1 1
#2 1 0 0 0 0 1
#3 0 1 0 0 0 0
#4 0 0 0 1 0 0
# Val
#ID diagnosis_copd diagnosis_diabetes gender_female gender_male
#1 0 0 0 1
#2 0 0 0 1
#3 0 1 1 0
#4 1 1 1 0
df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male",
"male", "male", "female", "female", "female"), age = c(42L, 42L,
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma",
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender",
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")
答案 3 :(得分:3)
使用基础R中的reshape
:
d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")
new.dat <- cbind(ID=d["ID"],
g[,grepl("_", names(g))],
a[,grepl("_", names(a))],
d[,grepl("_", names(d))])
# convert factors columns to character (if necessary)
# taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)
new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1
# ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
#1 1 1 0 1 0 0 0 1
#3 2 1 0 0 1 0 0 1
#4 3 0 1 0 0 1 0 0
#5 4 0 1 0 0 0 1 0
# diagnosis_anxiety diagnosis_diabetes diagnosis_copd
#1 1 0 0
#3 0 0 0
#4 0 1 0
#5 0 1 1
答案 4 :(得分:1)
以下是使用dcast()
和merge()
稍长的方式。由于性别和年龄不是唯一的ID,因此创建一个函数将其长度转换为虚拟变量(dum()
)。另一方面,通过调整公式将诊断设置为唯一计数。
library(reshape2)
data.raw <- read.table(header = T, sep = ",", text = "
id, gender, age, diagnosis
1, male, 42, asthma
1, male, 42, anxiety
2, male, 19, asthma
3, female, 23, diabetes
4, female, 61, diabetes
4, female, 61, copd")
# function to create a dummy variable
dum <- function(x) { if(length(x) > 0) 1 else 0 }
# length of dignosis by id, gender and age
diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)]
# length of gender by id
gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum)
# length of age by id
age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum)
merge(merge(gen, age, by = "id"), diag, by = "id")
# id female male 19 23 42 61 anxiety asthma copd diabetes
#1 1 0 1 0 0 1 0 1 1 0 0
#2 2 0 1 1 0 0 0 0 1 0 0
#3 3 1 0 0 1 0 0 0 0 0 1
#4 4 1 0 0 0 0 1 0 0 1 1
实际上我并不太了解您的模型,但是您的设置可能太多,因为R处理公式对象的因素。例如,如果性别是响应,则在R中将生成以下矩阵。因此,只要您不适合自己,就足以适当地设置数据类型和公式。
data.raw$age <- as.factor(data.raw$age)
model.matrix(gender ~ ., data = data.raw[,-1])
#(Intercept) age23 age42 age61 diagnosis asthma diagnosis copd diagnosis diabetes
#1 1 0 1 0 1 0 0
#2 1 0 1 0 0 0 0
#3 1 0 0 0 1 0 0
#4 1 1 0 0 0 0 1
#5 1 0 0 1 0 0 1
#6 1 0 0 1 0 1 0
如果您需要每个变量的所有级别,您可以通过抑制model.matrix
中的拦截并使用来自all-levels-of-a-factor-in-a-model-matrix-in-r的小魔法来实现此目的
# Using Akrun's df1, first change all variables, except ID, to factor
df1[-1] <- lapply(df1[-1], factor)
# Use model.matrix to derive dummy coding
m <- data.frame(model.matrix( ~ 0 + . , data=df1,
contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE)))
# Collapse to give final solution
aggregate(. ~ ID, data=m, max)