大家好,
我的数据大小接近4GB,列数为" UserID,MediaID,Full / Mini。我想知道每个用户观看了多少完整和迷你剧集。基本上每行都有每个用户观看的Full和Mini Epis号。另外,让我知道如何更有效地做到这一点,因为数据量巨大,并且会减慢处理速度。
感谢。任何帮助将受到高度赞赏。
dat=data.frame(id=c("a","a","a","b","c"), media_id=c("1a","1b","1c","2b","2c"), Full_mini=c("ful","ful","mini","mini","full")) id=c("a","a","a","b","c")
答案 0 :(得分:0)
你可以用桌子来做。
subset(as.data.frame(table(dat[-2])),Freq>0)
# id Full_mini Freq
# 1 a ful 2
# 6 c full 1
# 7 a mini 1
# 8 b mini 1
错别字是你的!
如果它仍然太慢,请将其拆分为2,这对您的数据集来说是一个幸运的事情,即您最后一个col只有两个可能的值。那么你将有2个较小的数据集,你只需要对一个col进行计数,这应该很快。
dat_full <- subset(dat,Full_mini == "full" | Full_mini == "ful")
dat_mini <- subset(dat,Full_mini == "mini")
library(magrittr)
res_full <- dat_full$id %>%
table %>%
as.data.frame %>%
subset(Freq>0) %>%
transform(Full_mini = "full") %>%
setNames(c("id","Freq","Full_mini"))
res_mini <- dat_mini$id %>%
table %>%
as.data.frame %>%
subset(Freq>0) %>%
transform(Full_mini = "mini") %>%
setNames(c("id","Freq","Full_mini"))
res <- rbind(res_full,res_mini)
或并排:
res_full <- dat_full$id %>%
table %>%
as.data.frame
res_mini <- dat_mini$id %>%
table %>%
as.data.frame
res <- setNames(cbind(res_full[1:2],res_mini[2]),c("id","full","mini"))
id full Freq
1 a 2 1
2 b 0 1
3 c 1 0