我有调查回复的数据框,有些专栏是参与者可以选择多个答案的问题(“选择所有适用的选项”)。
> age <- c(24, 28, 44, 55, 53)
> ethnicity <- c("ngoni", "bemba", "lozi tonga", "bemba tonga other", "bemba tongi")
> ethnicity_other <- c(NA, NA, "luvale", NA, NA)
> df <- data.frame(age, ethnicity, ethnicity_other)
我希望将这些问题设置为二元响应项,以便每个响应选项(在本例中为ethnicity
和ethnicity_other
)成为一个列向量,其中0或者一个1.
到目前为止,我编写了一个脚本,将各个唯一的响应分成一个列表(z
):
> x <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity_other), " ")), mode="list"))
> y <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity), " ")), mode="list"))
>
> combine <- c(x, y)
>
> z <- NULL
> for(i in combine){
> if(!is.na(i)){
> z <- append(z, i)
> }
> }
然后,我从该列表中创建了新列,并用NA值填充它们。
> for(elm in z){
> df[paste0("ethnicity_",elm)] <- NA
> }
所以现在我有35个额外的列,我想填充1和0,具体取决于是否可以在列名称(或该列名称的一部分,因为我用ethnicity_
作为前缀)中找到ethnicity
或ethnicity_other.
下的相应单元格我尝试了多种方法,没有很好的解决方案。
答案 0 :(得分:0)
我将如何做到这一点:
首先,你需要一些东西来存储每个参与者的种族。我的方法是建立一个这样的列表:
ethnicities = sapply(X=df$ethnicity, FUN=function(response) {return (strsplit(as.character(response), " "))} )
对于您的特定示例,我们将:
> ethnicities
[[1]]
[1] "ngoni"
[[2]]
[1] "bemba"
[[3]]
[1] "lozi" "tonga"
[[4]]
[1] "bemba" "tonga" "other"
[[5]]
[1] "bemba" "tongi"
然后,迭代这些以填充您的data.frame df :
for (i in seq_along(ethnicities)) {
for (eth in ethnicities[[i]]) {
df[[paste0('ethnicity_',eth)]][i]=1
}
}
df 的结果值应为:
> df
age ethnicity ethnicity_other ethnicity_luvale ethnicity_ngoni ethnicity_bemba
1 24 ngoni NA NA 1 NA
2 28 bemba NA NA NA 1
3 44 lozi tonga NA NA NA NA
4 55 bemba tonga other 1 NA NA 1
5 53 bemba tongi NA NA NA 1
ethnicity_lozi ethnicity_tonga ethnicity_tongi
1 NA NA NA
2 NA NA NA
3 1 1 NA
4 NA 1 NA
5 NA NA 1
还有其他方法可以做到这一点。你也可以在sapply中打包这两个 for循环,但我觉得生成的代码不会更有效(但读起来会更复杂!)。
这有帮助吗?
修改强>
顺便说一句,如果你真的想在data.frame中使用0而不是NA,那么就像更改代码初始化添加的列一样简单:
> for(elm in z){
> df[paste0("ethnicity_",elm)] <- 0 # instead of NA
> }
答案 1 :(得分:0)
以下是使用plyr
或data.table
完成此操作的几种方法。
all_ethnicities <- unique(c(
unlist(strsplit(df$ethnicity, " ")),
unlist(strsplit(df$ethnicity_other, " "))
))
df$id <- 1:nrow(df)
library(plyr)
ddply(df, .(id), function(x)
table(factor(unlist(strsplit(paste(x$ethnicity, x$ethnicity_other), " ")),
levels = all_ethnicities)))
## id ngoni bemba lozi tonga other tongi luvale
## 1 1 1 0 0 0 0 0 0
## 2 2 0 1 0 0 0 0 0
## 3 3 0 0 1 1 0 0 1
## 4 4 0 1 0 1 1 0 0
## 5 5 0 1 0 0 0 1 0
library(data.table)
DT <- data.table(df)
DT[, {
as.list(
table(
factor(
unlist(strsplit(paste(ethnicity, ethnicity_other), " ")),
levels = all_ethnicities)
),
)
}, by = id]
## id ngoni bemba lozi tonga other tongi luvale
## 1: 1 1 0 0 0 0 0 0
## 2: 2 0 1 0 0 0 0 0
## 3: 3 0 0 1 1 0 0 1
## 4: 4 0 1 0 1 1 0 0
## 5: 5 0 1 0 0 0 1 0
答案 2 :(得分:0)
这是一种使用我的&#34; splitstackshape&#34;中的concat.split.expanded
的方法。包:
## Combine your "ethnicity" and "ethnicity_other" column
df$ethnicity <- paste(df$ethnicity,
ifelse(is.na(df$ethnicity_other), "",
as.character(df$ethnicity_other)))
## Drop the original "ethnicity_other" column
df$ethnicity_other <- NULL
## Split up the new "ethnicity" column
library(splitstackshape)
concat.split.expanded(df, "ethnicity", sep=" ",
type="character", fill=0, drop=TRUE)
# age ethnicity_bemba ethnicity_lozi ethnicity_luvale ethnicity_ngoni
# 1 24 0 0 0 1
# 2 28 1 0 0 0
# 3 44 0 1 1 0
# 4 55 1 0 0 0
# 5 53 1 0 0 0
# ethnicity_other ethnicity_tonga ethnicity_tongi
# 1 0 0 0
# 2 0 0 0
# 3 0 1 0
# 4 1 1 0
# 5 0 0 1
可以轻松地将fill
参数设置为您想要的任何其他参数。默认为NA
,但在此处,我已将其设置为0
,因为我认为这是您正在寻找的内容。