R的新手,所以我只是围绕数据争论方面。尝试寻找类似的问题,但找不到它。
我想添加一个额外的列,即文章组之间每天分割的视图百分比。示例数据集
views date article
1578 2015-01-01 A
616 2015-01-01 B
575 2015-01-01 C
1744 2015-01-02 A
541 2015-01-02 B
660 2015-01-02 C
2906 2015-01-03 A
629 2015-01-03 B
643 2015-01-03 C
我期待的结果......
views percentage date article
1578 56.99 2015-01-01 A
616 22.25 2015-01-01 B
575 20.77 2015-01-01 C
1744 59.22 2015-01-02 A
541 18.37 2015-01-02 B
660 22.41 2015-01-02 C
2906 69.55 2015-01-03 A
629 15.06 2015-01-03 B
643 15.39 2015-01-03 C
我知道通过使用子集拆分日期框架可以实现这一点,但我希望使用库有更简洁的方法吗?
谢谢!
答案 0 :(得分:4)
library(dplyr)
df %>% group_by(date) %>% mutate( percentage = views/sum(views))
Source: local data frame [9 x 4]
Groups: date
views date article percentage
1 1578 2015-01-01 A 0.5698808
2 616 2015-01-01 B 0.2224630
3 575 2015-01-01 C 0.2076562
4 1744 2015-01-02 A 0.5921902
5 541 2015-01-02 B 0.1837012
6 660 2015-01-02 C 0.2241087
7 2906 2015-01-03 A 0.6955481
8 629 2015-01-03 B 0.1505505
9 643 2015-01-03 C 0.1539014
或者,如果每天可能有多篇相同的文章:
df %>% group_by(date) %>% mutate(sum = sum(views)) %>%
group_by(date, article) %>% mutate(percentage = views/sum) %>%
select(-sum)
答案 1 :(得分:3)
如果df
是您的data.frame,您可以执行以下操作:
library(data.table)
setDT(df)[,percentage:=signif(100*views/sum(views),4),by=date][]
# views date article percentage
#1: 1578 2015-01-01 A 56.99
#2: 616 2015-01-01 B 22.25
#3: 575 2015-01-01 C 20.77
#4: 1744 2015-01-02 A 59.22
#5: 541 2015-01-02 B 18.37
#6: 660 2015-01-02 C 22.41
#7: 2906 2015-01-03 A 69.55
#8: 629 2015-01-03 B 15.06
#9: 643 2015-01-03 C 15.39
或基地R
:
df$percentage = signif(100*with(df, ave(views, date, FUN=function(x) x/sum(x))),4)
数据:强>
df = structure(list(views = c(1578L, 616L, 575L, 1744L, 541L, 660L,
2906L, 629L, 643L), date = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("2015-01-01", "2015-01-02", "2015-01-03"
), class = "factor"), article = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
percentage = c(56.99, 22.25, 20.77, 59.22, 18.37, 22.41,
69.55, 15.06, 15.39)), .Names = c("views", "date", "article",
"percentage"), class = "data.frame", row.names = c(NA, -9L))