R-将列的列转换为不同的列,使用它们的值作为名称(虚拟)

时间:2016-06-17 17:52:09

标签: r dataframe split

我有一个包含电影数据的表格,在最后一栏中,它有电影所属的类别。

  movieId                              title                   category
       1                   Toy Story (1995)  Animation|Children|Comedy
       2                     Jumanji (1995) Adventure|Children|Fantasy
       3            Grumpier Old Men (1995)             Comedy|Romance
       4           Waiting to Exhale (1995)               Comedy|Drama
       5 Father of the Bride Part II (1995)                     Comedy
       6                        Heat (1995)      Action|Crime|Thriller

我想为每个类别创建一个列,如果它写在该电影的列表中则放1,否则放0。 类似的东西:

movieId title   animation   comedy  drama
1        xx        1           0      1
2        xy        1           0      0
3        yy        1           1      0

到目前为止,我只将字符串转换为包含以下内容的列表:

f<-function(x) {strsplit(x, split='|', fixed=TRUE)}
movies2$m<-lapply(movies2$category, f)

但我不知道如何做其余的事。

我在想Python字典。但我不知道如何在R中这样做。

数据

df1 <- read.table(header = TRUE, stringsAsFactors = FALSE,
                  text = " movieId                              title                   category
                  1                   'Toy Story (1995)'  Animation|Children|Comedy
                  2                     'Jumanji (1995)' Adventure|Children|Fantasy
                  3            'Grumpier Old Men (1995)'             Comedy|Romance
                  4           'Waiting to Exhale (1995)'               Comedy|Drama
                  5 'Father of the Bride Part II (1995)'                     Comedy
                  6                        'Heat (1995)'      Action|Crime|Thriller")

3 个答案:

答案 0 :(得分:5)

我们可以在分割后使用mtabulate中的qdapTools

library(qdapTools)
cbind(df1[-3],mtabulate(strsplit(df1$category, "[|]")))
# movieId                              title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller
#1       1                   Toy Story (1995)      0         0         1        1      1     0     0       0       0        0
#2       2                     Jumanji (1995)      0         1         0        1      0     0     0       1       0        0
#3       3            Grumpier Old Men (1995)      0         0         0        0      1     0     0       0       1        0
#4       4           Waiting to Exhale (1995)      0         0         0        0      1     0     1       0       0        0
#5       5 Father of the Bride Part II (1995)      0         0         0        0      1     0     0       0       0        0
#6       6                        Heat (1995)      1         0         0        0      0     1     0       0       0        1

或使用base R

cbind(df1[-3], as.data.frame.matrix(table(stack(setNames(strsplit(df1$category,
                           "[|]"), df1$movieId))[2:1])))

答案 1 :(得分:4)

以下是使用strsplit()分割列值的基本R可能性,然后使用grepl()vapply()中匹配它们。这里的诀窍是在FUN.VALUE = integer(.)中使用vapply(),以便将grepl()结果神奇地转换为整数。

## split the 'category' column on '|'
s <- strsplit(df$category, "|", fixed = TRUE)
## run the unique sorted values through grepl(), getting integer result
newPart <- vapply(sort(unique(unlist(s))), grepl, integer(nrow(df)), df$category, fixed = TRUE)
## bind result to other columns
cbind(df[-3], newPart)

这会产生以下数据框。

  movieId                              title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller
1       1                   Toy Story (1995)      0         0         1        1      1     0     0       0       0        0
2       2                     Jumanji (1995)      0         1         0        1      0     0     0       1       0        0
3       3            Grumpier Old Men (1995)      0         0         0        0      1     0     0       0       1        0
4       4           Waiting to Exhale (1995)      0         0         0        0      1     0     1       0       0        0
5       5 Father of the Bride Part II (1995)      0         0         0        0      1     0     0       0       0        0
6       6                        Heat (1995)      1         0         0        0      0     1     0       0       0        1

答案 2 :(得分:3)

一种非常接近的方法:

library(dplyr)
library(tidyr)
library(reshape2)
library(stringr)

max.categories = max(str_count(df1$category, "\\|")) + 1

df1new = df1 %>% separate(category, into=letters[1:max.categories], sep="\\|") %>%
  melt(c("movieId","title")) %>%
  filter(!is.na(value)) %>%
  dcast(movieId + title ~ value, fun.aggregate=length) 
  movieId                              title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller
1       1                   Toy Story (1995)      0         0         1        1      1     0     0       0       0        0
2       2                     Jumanji (1995)      0         1         0        1      0     0     0       1       0        0
3       3            Grumpier Old Men (1995)      0         0         0        0      1     0     0       0       1        0
4       4           Waiting to Exhale (1995)      0         0         0        0      1     0     1       0       0        0
5       5 Father of the Bride Part II (1995)      0         0         0        0      1     0     0       0       0        0
6       6                        Heat (1995)      1         0         0        0      0     1     0       0       0        1

max.categories只是一种以编程方式确保into向量至少与给定title的最大类别数一样长的方法。如果您已经知道这个值永远不会大于5,那么您可以这样做,例如,into=letters[1:5]