我有一个包含电影数据的表格,在最后一栏中,它有电影所属的类别。
movieId title category
1 Toy Story (1995) Animation|Children|Comedy
2 Jumanji (1995) Adventure|Children|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama
5 Father of the Bride Part II (1995) Comedy
6 Heat (1995) Action|Crime|Thriller
我想为每个类别创建一个列,如果它写在该电影的列表中则放1,否则放0。 类似的东西:
movieId title animation comedy drama
1 xx 1 0 1
2 xy 1 0 0
3 yy 1 1 0
到目前为止,我只将字符串转换为包含以下内容的列表:
f<-function(x) {strsplit(x, split='|', fixed=TRUE)}
movies2$m<-lapply(movies2$category, f)
但我不知道如何做其余的事。
我在想Python字典。但我不知道如何在R中这样做。
数据
df1 <- read.table(header = TRUE, stringsAsFactors = FALSE,
text = " movieId title category
1 'Toy Story (1995)' Animation|Children|Comedy
2 'Jumanji (1995)' Adventure|Children|Fantasy
3 'Grumpier Old Men (1995)' Comedy|Romance
4 'Waiting to Exhale (1995)' Comedy|Drama
5 'Father of the Bride Part II (1995)' Comedy
6 'Heat (1995)' Action|Crime|Thriller")
答案 0 :(得分:5)
我们可以在分割后使用mtabulate
中的qdapTools
library(qdapTools)
cbind(df1[-3],mtabulate(strsplit(df1$category, "[|]")))
# movieId title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller
#1 1 Toy Story (1995) 0 0 1 1 1 0 0 0 0 0
#2 2 Jumanji (1995) 0 1 0 1 0 0 0 1 0 0
#3 3 Grumpier Old Men (1995) 0 0 0 0 1 0 0 0 1 0
#4 4 Waiting to Exhale (1995) 0 0 0 0 1 0 1 0 0 0
#5 5 Father of the Bride Part II (1995) 0 0 0 0 1 0 0 0 0 0
#6 6 Heat (1995) 1 0 0 0 0 1 0 0 0 1
或使用base R
cbind(df1[-3], as.data.frame.matrix(table(stack(setNames(strsplit(df1$category,
"[|]"), df1$movieId))[2:1])))
答案 1 :(得分:4)
以下是使用strsplit()
分割列值的基本R可能性,然后使用grepl()
在vapply()
中匹配它们。这里的诀窍是在FUN.VALUE = integer(.)
中使用vapply()
,以便将grepl()
结果神奇地转换为整数。
## split the 'category' column on '|'
s <- strsplit(df$category, "|", fixed = TRUE)
## run the unique sorted values through grepl(), getting integer result
newPart <- vapply(sort(unique(unlist(s))), grepl, integer(nrow(df)), df$category, fixed = TRUE)
## bind result to other columns
cbind(df[-3], newPart)
这会产生以下数据框。
movieId title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller 1 1 Toy Story (1995) 0 0 1 1 1 0 0 0 0 0 2 2 Jumanji (1995) 0 1 0 1 0 0 0 1 0 0 3 3 Grumpier Old Men (1995) 0 0 0 0 1 0 0 0 1 0 4 4 Waiting to Exhale (1995) 0 0 0 0 1 0 1 0 0 0 5 5 Father of the Bride Part II (1995) 0 0 0 0 1 0 0 0 0 0 6 6 Heat (1995) 1 0 0 0 0 1 0 0 0 1
答案 2 :(得分:3)
一种非常接近的方法:
library(dplyr)
library(tidyr)
library(reshape2)
library(stringr)
max.categories = max(str_count(df1$category, "\\|")) + 1
df1new = df1 %>% separate(category, into=letters[1:max.categories], sep="\\|") %>%
melt(c("movieId","title")) %>%
filter(!is.na(value)) %>%
dcast(movieId + title ~ value, fun.aggregate=length)
movieId title Action Adventure Animation Children Comedy Crime Drama Fantasy Romance Thriller 1 1 Toy Story (1995) 0 0 1 1 1 0 0 0 0 0 2 2 Jumanji (1995) 0 1 0 1 0 0 0 1 0 0 3 3 Grumpier Old Men (1995) 0 0 0 0 1 0 0 0 1 0 4 4 Waiting to Exhale (1995) 0 0 0 0 1 0 1 0 0 0 5 5 Father of the Bride Part II (1995) 0 0 0 0 1 0 0 0 0 0 6 6 Heat (1995) 1 0 0 0 0 1 0 0 0 1
max.categories
只是一种以编程方式确保into
向量至少与给定title
的最大类别数一样长的方法。如果您已经知道这个值永远不会大于5,那么您可以这样做,例如,into=letters[1:5]
。