Question

我有调查回复的数据框，有些专栏是参与者可以选择多个答案的问题（“选择所有适用的选项”）。

> age <- c(24, 28, 44, 55, 53)
> ethnicity <- c("ngoni", "bemba", "lozi tonga", "bemba tonga other", "bemba tongi")
> ethnicity_other <- c(NA, NA, "luvale", NA, NA) 
> df <- data.frame(age, ethnicity, ethnicity_other)

我希望将这些问题设置为二元响应项，以便每个响应选项（在本例中为ethnicity和ethnicity_other）成为一个列向量，其中0或者一个1.

到目前为止，我编写了一个脚本，将各个唯一的响应分成一个列表（z）：

> x <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity_other), " ")),    mode="list"))
> y <- unique(as.vector(unlist(strsplit(as.character(df$ethnicity), " ")), mode="list"))
>
> combine <- c(x, y)
>
> z <- NULL
> for(i in combine){
> if(!is.na(i)){
> z <- append(z, i)
>   }   
> }

然后，我从该列表中创建了新列，并用NA值填充它们。

> for(elm in z){
>   df[paste0("ethnicity_",elm)]  <- NA
> }

所以现在我有35个额外的列，我想填充1和0，具体取决于是否可以在列名称（或该列名称的一部分，因为我用ethnicity_作为前缀）中找到ethnicity或ethnicity_other.下的相应单元格我尝试了多种方法，没有很好的解决方案。

Answer 1

我将如何做到这一点：

首先，你需要一些东西来存储每个参与者的种族。我的方法是建立一个这样的列表：

ethnicities = sapply(X=df$ethnicity, FUN=function(response) {return (strsplit(as.character(response), " "))} )

对于您的特定示例，我们将：

> ethnicities
[[1]]
[1] "ngoni"

[[2]]
[1] "bemba"

[[3]]
[1] "lozi"  "tonga"

[[4]]
[1] "bemba" "tonga" "other"

[[5]]
[1] "bemba" "tongi"

然后，迭代这些以填充您的data.frame df ：

for (i in seq_along(ethnicities)) {
  for (eth in ethnicities[[i]]) {
    df[[paste0('ethnicity_',eth)]][i]=1
  }
}

df 的结果值应为：

> df
  age         ethnicity ethnicity_other ethnicity_luvale ethnicity_ngoni ethnicity_bemba
1  24             ngoni              NA               NA               1              NA
2  28             bemba              NA               NA              NA               1
3  44        lozi tonga              NA               NA              NA              NA
4  55 bemba tonga other               1               NA              NA               1
5  53       bemba tongi              NA               NA              NA               1
  ethnicity_lozi ethnicity_tonga ethnicity_tongi
1             NA              NA              NA
2             NA              NA              NA
3              1               1              NA
4             NA               1              NA
5             NA              NA               1

还有其他方法可以做到这一点。你也可以在sapply中打包这两个 for循环，但我觉得生成的代码不会更有效（但读起来会更复杂！）。

这有帮助吗？

修改

顺便说一句，如果你真的想在data.frame中使用0而不是NA，那么就像更改代码初始化添加的列一样简单：

> for(elm in z){ > df[paste0("ethnicity_",elm)] <- 0 # instead of NA > }

Answer 2

以下是使用plyr或data.table完成此操作的几种方法。

all_ethnicities <- unique(c(
    unlist(strsplit(df$ethnicity, " ")),
    unlist(strsplit(df$ethnicity_other, " "))
    ))

df$id <- 1:nrow(df)

library(plyr)

ddply(df, .(id), function(x)
      table(factor(unlist(strsplit(paste(x$ethnicity, x$ethnicity_other), " ")),
                   levels = all_ethnicities)))

##    id ngoni bemba lozi tonga other tongi luvale
## 1  1     1     0    0     0     0     0      0
## 2  2     0     1    0     0     0     0      0
## 3  3     0     0    1     1     0     0      1
## 4  4     0     1    0     1     1     0      0
## 5  5     0     1    0     0     0     1      0

library(data.table)

DT <- data.table(df)

DT[, {
    as.list(
        table(
            factor(
                unlist(strsplit(paste(ethnicity, ethnicity_other),  " ")),
                levels = all_ethnicities)
            ),
        )
}, by = id]

##     id ngoni bemba lozi tonga other tongi luvale
## 1:  1     1     0    0     0     0     0      0
## 2:  2     0     1    0     0     0     0      0
## 3:  3     0     0    1     1     0     0      1
## 4:  4     0     1    0     1     1     0      0
## 5:  5     0     1    0     0     0     1      0

Answer 3

这是一种使用我的＆＃34; splitstackshape＆＃34;中的concat.split.expanded的方法。包：

## Combine your "ethnicity" and "ethnicity_other" column
df$ethnicity <- paste(df$ethnicity, 
                      ifelse(is.na(df$ethnicity_other), "", 
                             as.character(df$ethnicity_other)))
## Drop the original "ethnicity_other" column
df$ethnicity_other <- NULL

## Split up the new "ethnicity" column
library(splitstackshape)
concat.split.expanded(df, "ethnicity", sep=" ", 
                      type="character", fill=0, drop=TRUE)
#   age ethnicity_bemba ethnicity_lozi ethnicity_luvale ethnicity_ngoni
# 1  24               0              0                0               1
# 2  28               1              0                0               0
# 3  44               0              1                1               0
# 4  55               1              0                0               0
# 5  53               1              0                0               0
#   ethnicity_other ethnicity_tonga ethnicity_tongi
# 1               0               0               0
# 2               0               0               0
# 3               0               1               0
# 4               1               1               0
# 5               0               0               1

可以轻松地将fill参数设置为您想要的任何其他参数。默认为NA，但在此处，我已将其设置为0，因为我认为这是您正在寻找的内容。

将“select all apply”转换为二元选择

3 个答案: