在R中使用dplyr计算子组内的“完成百分比”?

时间:2019-12-18 06:13:47

标签: r dplyr

我正在对R中的音乐会曲目列表和歌曲歌词进行一些分析。我想根据演出期间播放某些歌曲的功能来比较各个节目中的某些歌曲功能。目前,我的数据采用以下格式,其中show_num是该节目的ID,song_num是该节目中播放的第一,第二,第三等歌曲。

track_name     show_num    song_num  lyrics
One Song       1           1         line 1
One Song       1           1         line 2
Another Song   1           2         line 1
Another Song   1           2         line 2
Final Song     1           3         line 1
Final Song     1           3         line 2
Final Song     1           3         line 3
One Song       2           1         line 1

我想创建一个新变量,用于计算播放每首歌曲时演出的距离。例如,以前的数据集在理想情况下将如下所示:

track_name     show_num    song_num  lyrics    perc_complete
One Song       1           1         line 1    .33
One Song       1           1         line 2    .33
Another Song   1           2         line 1    .67
Another Song   1           2         line 2    .67
Final Song     1           3         line 1    1.0
Final Song     1           3         line 2    1.0
Final Song     1           3         line 3    1.0
One Song       2           1         line 1    .20
One Song       2           1         line 1    .20

我尝试使用百分位数排名方法

df = tour_w_lyrics%>%
  group_by(show_num) %>% 
  mutate(perc_complete=rank(song_num)/length(song_num))

但是很快就知道为什么没有第100个百分位数。我应该如何使用dplyr创建理想的数据集?还是我会错误地进行分析?感谢您的任何帮助,谢谢!

1 个答案:

答案 0 :(得分:1)

我们可以将当前song_num除以每个节目中的歌曲总数。

library(dplyr)

df %>% group_by(show_num) %>% mutate(perc_complete = song_num/max(song_num))

# track_name  show_num song_num lyrics perc_complete
#  <fct>          <int>    <int> <fct>          <dbl>
#1 OneSong            1        1 line1          0.333
#2 OneSong            1        1 line2          0.333
#3 AnotherSong        1        2 line1          0.667
#4 AnotherSong        1        2 line2          0.667
#5 FinalSong          1        3 line1          1    
#6 FinalSong          1        3 line2          1    
#7 FinalSong          1        3 line3          1    
#.....

df %>% group_by(show_num) %>% mutate(perc_complete = song_num/n_distinct(song_num))

在基数R中,我们可以将ave用作

df$perc_complete <- with(df, song_num/ave(song_num, show_num, FUN = max))

数据

df <- structure(list(track_name = structure(c(3L, 3L, 1L, 1L, 2L, 2L, 
2L, 3L), .Label = c("AnotherSong", "FinalSong", "OneSong"), class = "factor"), 
show_num = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), song_num = c(1L, 
1L, 2L, 2L, 3L, 3L, 3L, 1L), lyrics = structure(c(1L, 2L, 
1L, 2L, 1L, 2L, 3L, 1L), .Label = c("line1", "line2", "line3"
), class = "factor")), class = "data.frame", row.names = c(NA, -8L))