Question

我有一个包含电影数据的表格，在最后一栏中，它有电影所属的类别。

  movieId                              title                   category
       1                   Toy Story (1995)  Animation|Children|Comedy
       2                     Jumanji (1995) Adventure|Children|Fantasy
       3            Grumpier Old Men (1995)             Comedy|Romance
       4           Waiting to Exhale (1995)               Comedy|Drama
       5 Father of the Bride Part II (1995)                     Comedy
       6                        Heat (1995)      Action|Crime|Thriller

我想为每个类别创建一个列，如果它写在该电影的列表中则放1，否则放0。类似的东西：

movieId title   animation   comedy  drama
1        xx        1           0      1
2        xy        1           0      0
3        yy        1           1      0

到目前为止，我只将字符串转换为包含以下内容的列表：

f<-function(x) {strsplit(x, split='|', fixed=TRUE)}
movies2$m<-lapply(movies2$category, f)

但我不知道如何做其余的事。

我在想Python字典。但我不知道如何在R中这样做。

数据

df1 <- read.table(header = TRUE, stringsAsFactors = FALSE,
                  text = " movieId                              title                   category
                  1                   'Toy Story (1995)'  Animation|Children|Comedy
                  2                     'Jumanji (1995)' Adventure|Children|Fantasy
                  3            'Grumpier Old Men (1995)'             Comedy|Romance
                  4           'Waiting to Exhale (1995)'               Comedy|Drama
                  5 'Father of the Bride Part II (1995)'                     Comedy
                  6                        'Heat (1995)'      Action|Crime|Thriller")

Answer 1

我们可以在分割后使用mtabulate中的qdapTools

library(qdapTools)
cbind(df1[-3],mtabulate(strsplit(df1$category, "[|]")))
# movieId                              title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller
#1       1                   Toy Story (1995)      0         0         1        1      1     0     0       0       0        0
#2       2                     Jumanji (1995)      0         1         0        1      0     0     0       1       0        0
#3       3            Grumpier Old Men (1995)      0         0         0        0      1     0     0       0       1        0
#4       4           Waiting to Exhale (1995)      0         0         0        0      1     0     1       0       0        0
#5       5 Father of the Bride Part II (1995)      0         0         0        0      1     0     0       0       0        0
#6       6                        Heat (1995)      1         0         0        0      0     1     0       0       0        1

或使用base R

cbind(df1[-3], as.data.frame.matrix(table(stack(setNames(strsplit(df1$category,
                           "[|]"), df1$movieId))[2:1])))

Answer 2

以下是使用strsplit()分割列值的基本R可能性，然后使用grepl()在vapply()中匹配它们。这里的诀窍是在FUN.VALUE = integer(.)中使用vapply()，以便将grepl()结果神奇地转换为整数。

## split the 'category' column on '|'
s <- strsplit(df$category, "|", fixed = TRUE)
## run the unique sorted values through grepl(), getting integer result
newPart <- vapply(sort(unique(unlist(s))), grepl, integer(nrow(df)), df$category, fixed = TRUE)
## bind result to other columns
cbind(df[-3], newPart)

这会产生以下数据框。

  movieId                              title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller
1       1                   Toy Story (1995)      0         0         1        1      1     0     0       0       0        0
2       2                     Jumanji (1995)      0         1         0        1      0     0     0       1       0        0
3       3            Grumpier Old Men (1995)      0         0         0        0      1     0     0       0       1        0
4       4           Waiting to Exhale (1995)      0         0         0        0      1     0     1       0       0        0
5       5 Father of the Bride Part II (1995)      0         0         0        0      1     0     0       0       0        0
6       6                        Heat (1995)      1         0         0        0      0     1     0       0       0        1

Answer 3

一种非常接近的方法：

library(dplyr)
library(tidyr)
library(reshape2)
library(stringr)

max.categories = max(str_count(df1$category, "\\|")) + 1

df1new = df1 %>% separate(category, into=letters[1:max.categories], sep="\\|") %>%
  melt(c("movieId","title")) %>%
  filter(!is.na(value)) %>%
  dcast(movieId + title ~ value, fun.aggregate=length)

  movieId                              title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller
1       1                   Toy Story (1995)      0         0         1        1      1     0     0       0       0        0
2       2                     Jumanji (1995)      0         1         0        1      0     0     0       1       0        0
3       3            Grumpier Old Men (1995)      0         0         0        0      1     0     0       0       1        0
4       4           Waiting to Exhale (1995)      0         0         0        0      1     0     1       0       0        0
5       5 Father of the Bride Part II (1995)      0         0         0        0      1     0     0       0       0        0
6       6                        Heat (1995)      1         0         0        0      0     1     0       0       0        1

max.categories只是一种以编程方式确保into向量至少与给定title的最大类别数一样长的方法。如果您已经知道这个值永远不会大于5，那么您可以这样做，例如，into=letters[1:5]。

R-将列的列转换为不同的列，使用它们的值作为名称（虚拟）

3 个答案: