我有像MovieLense 1M数据文件一样的简单数据
item_id title genres
1 1 Toy Story (1995) Animation|Children's|Comedy
2 2 Jumanji (1995) Adventure|Children's|Fantasy
3 3 Grumpier Old Men (1995) Comedy|Romance
4 4 Waiting to Exhale (1995) Comedy|Drama
5 5 Father of the Bride Part II (1995) Comedy
6 6 Heat (1995) Action|Crime|Thriller
我的genres
列数据包含19个值。我应该如何更改我的数据以显示如上样本?
genreTbl['title']
title
1 unknown
2 Action
3 Adventure
4 Animation
5 Children's
6 Comedy
7 Crime
8 Documentary
9 Drama
10 Fantasy
11 Film-Noir
12 Horror
13 Musical
14 Mystery
15 Romance
16 Sci-Fi
17 Thriller
18 War
19 Western
我想将我的数据更改为此结构:
item_id movie_title release_date
1 1 Toy Story (1995) <NA>
2 2 GoldenEye (1995) <NA>
3 3 Four Rooms (1995) <NA>
4 4 Get Shorty (1995) <NA>
5 5 Copycat (1995) <NA>
6 6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) <NA>
unknown Action Adventure Animation Children's Comedy Crime Documentary Drama
1 0 0 0 1 1 1 0 0 0
2 0 1 1 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 1 0 0 0 1 0 0 1
5 0 0 0 0 0 0 1 0 1
6 0 0 0 0 0 0 0 0 1
Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 1 0 0
3 0 0 0 0 0 0 0 1 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 1 0 0
6 0 0 0 0 0 0 0 0 0 0
我需要所有类型都在列中,如上所述,如果我的项目类型值包含选定的流派值应为1,否则为0.
答案 0 :(得分:5)
或者您可以使用concat.split
功能,也可以使用splitstackshape
包:
library(splitstackshape)
concat.split.expanded(df, split.col = "genres", sep = "|", type = "character",
mode = "binary", fixed = TRUE, fill = 0)
## Alternative alias
## Note also `drop = TRUE` to drop the original column
cSplit_e(mydf, split.col = "genres", sep = "|", type = "character",
mode = "binary", fixed = TRUE, fill = 0, drop = TRUE)
答案 1 :(得分:4)
使用 splitstackshape 中的cSplit
和 reshape2 / data.table 中的dcast
的组合。通过使用length
作为聚合函数,可以创建逻辑整数值:
library(splitstackshape)
library(reshape2) # or library(data.table)
dcast(cSplit(mydf, "genres", sep="|", "long"),
item_id + title ~ genres,
fun.aggregate = length)
给出:
item_id title Action Adventure Animation Children's Comedy Crime Drama Fantasy Romance Thriller
1: 1 ToyStory(1995) 0 0 1 1 1 0 0 0 0 0
2: 2 Jumanji(1995) 0 1 0 1 0 0 0 1 0 0
3: 3 GrumpierOldMen(1995) 0 0 0 0 1 0 0 0 1 0
4: 4 WaitingtoExhale(1995) 0 0 0 0 1 0 1 0 0 0
5: 5 FatheroftheBridePartII(1995) 0 0 0 0 1 0 0 0 0 0
6: 6 Heat(1995) 1 0 0 0 0 1 0 0 0 1
使用过的数据:
mydf <- structure(list(item_id = 1:6, title = structure(c(5L, 4L, 2L,
6L, 1L, 3L), .Label = c("FatheroftheBridePartII(1995)", "GrumpierOldMen(1995)",
"Heat(1995)", "Jumanji(1995)", "ToyStory(1995)", "WaitingtoExhale(1995)"
), class = "factor"), genres = structure(c(3L, 2L, 6L, 5L, 4L,
1L), .Label = c("Action|Crime|Thriller", "Adventure|Children's|Fantasy",
"Animation|Children's|Comedy", "Comedy", "Comedy|Drama", "Comedy|Romance"
), class = "factor")), .Names = c("item_id", "title", "genres"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6"))