Question

我有像MovieLense 1M数据文件一样的简单数据

  item_id                              title                       genres
1       1                   Toy Story (1995)  Animation|Children's|Comedy
2       2                     Jumanji (1995) Adventure|Children's|Fantasy
3       3            Grumpier Old Men (1995)               Comedy|Romance
4       4           Waiting to Exhale (1995)                 Comedy|Drama
5       5 Father of the Bride Part II (1995)                       Comedy
6       6                        Heat (1995)        Action|Crime|Thriller

我的genres列数据包含19个值。我应该如何更改我的数据以显示如上样本？

流派表

genreTbl['title']
         title
1      unknown
2       Action
3    Adventure
4    Animation
5   Children's
6       Comedy
7        Crime
8  Documentary
9        Drama
10     Fantasy
11   Film-Noir
12      Horror
13     Musical
14     Mystery
15     Romance
16      Sci-Fi
17    Thriller
18         War
19     Western

我想将我的数据更改为此结构：

  item_id                                          movie_title release_date
1       1                                     Toy Story (1995)         <NA>
2       2                                     GoldenEye (1995)         <NA>
3       3                                    Four Rooms (1995)         <NA>
4       4                                    Get Shorty (1995)         <NA>
5       5                                       Copycat (1995)         <NA>
6       6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)         <NA>
  unknown Action Adventure Animation Children's Comedy Crime Documentary Drama
1       0      0         0         1          1      1     0           0     0
2       0      1         1         0          0      0     0           0     0
3       0      0         0         0          0      0     0           0     0
4       0      1         0         0          0      1     0           0     1
5       0      0         0         0          0      0     1           0     1
6       0      0         0         0          0      0     0           0     1
  Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
1       0         0      0       0       0       0      0        0   0       0
2       0         0      0       0       0       0      0        1   0       0
3       0         0      0       0       0       0      0        1   0       0
4       0         0      0       0       0       0      0        0   0       0
5       0         0      0       0       0       0      0        1   0       0
6       0         0      0       0       0       0      0        0   0       0

我需要所有类型都在列中，如上所述，如果我的项目类型值包含选定的流派值应为1，否则为0.

Answer 1

或者您可以使用concat.split功能，也可以使用splitstackshape包：

library(splitstackshape)
concat.split.expanded(df, split.col = "genres", sep = "|", type = "character",
                  mode = "binary", fixed = TRUE, fill = 0)

## Alternative alias
## Note also `drop = TRUE` to drop the original column
cSplit_e(mydf, split.col = "genres", sep = "|", type = "character", 
         mode = "binary", fixed = TRUE, fill = 0, drop = TRUE)

Answer 2

使用 splitstackshape 中的cSplit和 reshape2 / data.table 中的dcast的组合。通过使用length作为聚合函数，可以创建逻辑整数值：

library(splitstackshape)
library(reshape2)   # or library(data.table)
dcast(cSplit(mydf, "genres", sep="|", "long"),
      item_id + title ~ genres, 
      fun.aggregate = length)

给出：

   item_id                        title Action Adventure Animation Children's Comedy Crime Drama Fantasy Romance Thriller
1:       1               ToyStory(1995)      0         0         1          1      1     0     0       0       0        0
2:       2                Jumanji(1995)      0         1         0          1      0     0     0       1       0        0
3:       3         GrumpierOldMen(1995)      0         0         0          0      1     0     0       0       1        0
4:       4        WaitingtoExhale(1995)      0         0         0          0      1     0     1       0       0        0
5:       5 FatheroftheBridePartII(1995)      0         0         0          0      1     0     0       0       0        0
6:       6                   Heat(1995)      1         0         0          0      0     1     0       0       0        1

使用过的数据：

mydf <- structure(list(item_id = 1:6, title = structure(c(5L, 4L, 2L, 
6L, 1L, 3L), .Label = c("FatheroftheBridePartII(1995)", "GrumpierOldMen(1995)", 
"Heat(1995)", "Jumanji(1995)", "ToyStory(1995)", "WaitingtoExhale(1995)"
), class = "factor"), genres = structure(c(3L, 2L, 6L, 5L, 4L, 
1L), .Label = c("Action|Crime|Thriller", "Adventure|Children's|Fantasy", 
"Animation|Children's|Comedy", "Comedy", "Comedy|Drama", "Comedy|Romance"
), class = "factor")), .Names = c("item_id", "title", "genres"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5", 
"6"))

在R中分隔一列

流派表

2 个答案: