Question

我有一个带有分类变量的数据框，其中包含列表字符串，长度可变（这很重要，因为否则此问题将与this或{{3}重复}），例如：

df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df

  x       y
1 1       A
2 2    A, B
3 3       C
4 4 B, D, C
5 5       E

所需形式是df$y中任何地方看到的每个唯一字符串的虚拟变量，即：

data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))

  x A B C D E
1 1 1 0 0 0 0
2 2 1 1 0 0 0
3 3 0 0 1 0 0
4 4 0 1 1 1 0
5 5 0 0 0 0 1

这种天真的方法有效：

> uniqueStrings <- unique(unlist(df$y))
> n <- ncol(df)
> for (i in 1:length(uniqueStrings)) {
+   df[,  n + i] <- sapply(df$y, function(x) ifelse(uniqueStrings[i] %in% x, 1, 0))
+   colnames(df)[n + i] <- uniqueStrings[i]
+ }

然而，对于大数据帧，它非常难看，懒惰和缓慢。

有什么建议吗？来自tidyverse？

的奇特之处

更新：我有三种不同的方法。我在我的（Windows 7,32GB RAM）笔记本电脑上使用system.time在真实数据集上测试它们，包括1M行，每行包含1到4个字符串的长度列表（从〜350个唯一字符串值），磁盘总体上为200MB。所以预期的结果是一个尺寸为1M x 350的数据框。tidyverse（@Sotos）和base（@ joel.wilson）方法花了这么长时间我不得不重新启动R. {{1但是（@akrun）方法却很棒：

qdapTools

所以这就是我接受的方法。

Answer 1

我们可以使用mtabulate

library(qdapTools)
cbind(df[1], mtabulate(df$y))
#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

Answer 2

另一个想法，

library(dplyr)
library(tidyr)

df %>% 
 unnest(y) %>% 
 mutate(new = 1) %>% 
 spread(y, new, fill = 0) 

#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

除了您在评论中提到的案例，我们可以使用dcast中的reshape2，因为它比spread更灵活，

df2 <- df %>% 
        unnest(y) %>% 
        group_by(x) %>% 
        filter(!duplicated(y)) %>% 
        ungroup()

reshape2::dcast(df2, x ~ y, value.var = 'y', length)

#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

#or with df$x <- c(1, 1, 2, 2, 3)

#  x A B C D E
#1 1 1 1 0 0 0
#2 2 0 1 1 1 0
#3 3 0 0 0 0 1

#or with df$x <- rep(1,5)

#  x A B C D E
#1 1 1 1 1 1 1

Answer 3

这不涉及外部包，

# thanks to Sotos for suggesting to use `unique(unlist(df$y))` instead of `LETTERS[1!:5]`
sapply(unique(unlist(df$y)), function(j) as.numeric(grepl(j, df$y)))
#     A B C D E
#[1,] 1 0 0 0 0
#[2,] 1 1 0 0 0
#[3,] 0 0 1 0 0
#[4,] 0 1 1 1 0
#[5,] 0 0 0 0 1

R：根据列表的分类变量创建虚拟变量

3 个答案:

R：根据列表的分类变量*创建虚拟变量*

3 个答案:

R：根据列表的分类变量创建虚拟变量