我正在分析有关电影的数据。我想解析和统计类型。
我的数据如下:
data1,data2,comedy | action | adventure,data4,data5
我想算一些流派。我已设法用代码解析项目:
genres <- as.data.frame(table(c(movies["genres"])))
# genres look like:
# Var1 Freq
# 1 Action 11
# 2 Action|Adventure 11
# 3 Action|Adventure|Animation|Comedy|Crime|Family|Fantasy 1
# ...
# as you can see there are items which I need to parse
# for 'debugging' purpose I managed to get
strsplit(toString(genres$Var1[3]), split = "|", fixed = TRUE)
# which results in below output:
# [[1]]
# [1] "Action" "Adventure" "Animation" "Comedy" "Crime" "Family" "Fantasy"
# My idea is to gather every parsed item into one object, then treated
# that object "as.data.frame" so I could use 'Freq' from data.frame
# please take a look at below code:
genres <- as.data.frame(table(c(movies["genres"])))
list <- c()
i = 1
while(i <= length(genres$Var1)){
parse <- strsplit(toString(genres$Var1[i]), split = "|", fixed = TRUE)
merge(list, parse)
i = i + 1
}
有人可以更好地了解如何完成它,或者我如何以更简单的方式计算类型。 提前致谢
答案 0 :(得分:1)
这是你要找的?
# split each row on "|"
xx = strsplit(as.character(df$Var1), "|", fixed = TRUE)
# based on the length of 'list of genre' in each row, repeat the corresponding 'Freq'
yy = lapply(1:length(xx), function(x) rep(df$Freq[x], length(xx[[x]])))
df1 = data.frame(genre = unlist(xx), freq = unlist(yy))
library(dplyr)
df1 %>% group_by(genre) %>% summarise(total_freq = sum(freq))
# genre total_freq
#1 Action 23
#2 Adventure 12
#3 Animation 1
#4 Comedy 1
#5 Crime 1
#6 Family 1
#7 Fantasy 1
# where data df is
#df
# Var1 Freq
#1: Action 11
#2: Action|Adventure 11
#3: Action|Adventure|Animation|Comedy|Crime|Family|Fantasy 1