我需要为每部电影(列)计算评分为4或大于4的行数。然后除以评级的总数。如何才能做到这一点 ? 请看下面的图片,以获得简短的想法。
最终结果应该是
0.7000000, 'The Shawshank Redemption'
0.5333333, 'Star Wars IV - A New Hope'
0.5000000, 'Gladiator'
0.4444444, 'Blade Runner'
0.4375000, 'The Silence of the Lambs'
答案 0 :(得分:1)
数据不是通常的整洁格式。 df
是您的数据框,包含一些临时值。
library(dplyr)
df <- data_frame(user = letters[1:10],
m1 = c(1,5,NA,NA,4,2,NA,4,5,4),
m2 = c(5,3,NA,3,3,4,NA,NA,1,2),
m3 = c(2,NA,NA,NA,4,4,3,NA,NA,NA))
df
# A tibble: 10 × 4
# user m1 m2 m3
# <chr> <dbl> <dbl> <dbl>
#1 a 1 5 2
#2 b 5 3 NA
#3 c NA NA NA
#4 d NA 3 NA
#5 e 4 3 4
#6 f 2 4 4
#7 g NA NA 3
#8 h 4 NA NA
#9 i 5 1 NA
#10 j 4 2 NA
在这种情况下,我们将其转换为key:value
对,即movie:rating
。
library(tidyr)
df <- gather(df, movie, rating, -user)
df
# A tibble: 30 × 3
# user movie rating
# <chr> <chr> <dbl>
#1 a m1 1
#2 b m1 5
#3 c m1 NA
#4 d m1 NA
#5 e m1 4
#6 f m1 2
#7 g m1 NA
#8 h m1 4
#9 i m1 5
#10 j m1 4
# ... with 20 more rows
现在很容易总结。
df %>% group_by(movie) %>% summarise(countp = mean(rating>=4, na.rm=T))
# A tibble: 3 × 2
# movie countp
# <chr> <dbl>
#1 m1 0.7142857
#2 m2 0.2857143
#3 m3 0.5000000
答案 1 :(得分:0)
您可以使用colMeans
计算百分比,并将结果stack
计算为长格式:
示例数据框:
df = data.frame(user = c("A", "B", "C", "D"),
movieA = c(4,2,NA,5),
movieB = c(1,1,NA,4))
stack(colMeans(df[-1] >= 4, na.rm = T))
# values ind
#1 0.6666667 movieA
#2 0.3333333 movieB
要了解其工作原理:
df[-1] >= 4 # returns a boolean matrix where ratings >= 4 gives TRUE
# movieA movieB
#[1,] TRUE FALSE
#[2,] FALSE FALSE
#[3,] NA NA
#[4,] TRUE TRUE
布尔向量的平均值是TRUE的百分比(移除NA
),因此计算colMeans
所有列的平均值将为您提供百分比你需要。
答案 2 :(得分:0)
ratings<-data.frame(User=c("John","Maria","Anton","Roger","Martina","Ana","Sergi","Marc","Jim","Chris")
,Star.Wars.IV...A.New.Hope=c(1,5,NA,NA,4,2,NA,4,5,4)
,Star.Wars.VI...Return.of.the.Jedi=c(5,3,NA,3,3,4,NA,NA,1,2)
,Forrest.Gump=c(2,NA,NA,NA,4,4,3,NA,NA,2)
)
ratings
User Star.Wars.IV...A.New.Hope Star.Wars.VI...Return.of.the.Jedi Forrest.Gump
1 John 1 5 2
2 Maria 5 3 NA
3 Anton NA NA NA
4 Roger NA 3 NA
5 Martina 4 3 4
6 Ana 2 4 4
7 Sergi NA NA 3
8 Marc 4 NA NA
9 Jim 5 1 NA
10 Chris 4 2 2
如果您想在总评分数中加入NA
:
colSums(ratings[,-1]>=4,na.rm=T)/nrow(ratings)
Star.Wars.IV...A.New.Hope Star.Wars.VI...Return.of.the.Jedi Forrest.Gump
0.5 0.2 0.2
如果您想从总评分数中排除NA
:
colMeans(ratings[,-1]>=4,na.rm=T)
Star.Wars.IV...A.New.Hope Star.Wars.VI...Return.of.the.Jedi Forrest.Gump
0.7142857143 0.2857142857 0.4000000000