我正在通过删除收入最低的50%的行来切换data.frame,现在我想加入旧的data.frame,这样我就可以在切片之前将结果与切片进行比较。
我有一个解决方案,但寻找更优雅。
require(dplyr)
> #creating my data.frame with revenue for id and subid
> df <- data.frame(id = gl(n = 2, k= 5, length = 10),
+ subid = gl(n = 6, k = 2, length = 10),
+ rev = rnorm(10, 100, 15))
> df
id subid rev
1 1 1 102.80694
2 1 1 77.88691
3 1 2 122.71019
4 1 2 67.13475
5 1 3 93.21146
6 2 3 91.48368
7 2 4 103.05535
8 2 4 82.27343
9 2 5 106.03651
10 2 5 81.14182
>
> #keep only subid with 50% highest turnover within each id
> df_sliced <- df %>%
+ arrange(id, desc(rev)) %>%
+ group_by(id) %>%
+ slice(seq(n()*0.5)) %>%
+ group_by(id) %>%
+ summarise(rev_sliced = sum(rev))
>
> df_sliced
Source: local data frame [2 x 2]
id rev_sliced
(fctr) (dbl)
1 1 225.5171
2 2 209.0919
>
> #now I want to join back and compare my sliced result with result before slice.
> df_desired <- df %>%
+ group_by(id) %>%
+ summarise(rev = sum(rev)) %>%
+ cbind(df_sliced) #this will obviously also give me two columns with id. Desired result is with only one column for id.
>
> df_desired
id rev id rev_sliced
1 1 463.7503 1 225.5171
2 2 463.9908 2 209.0919
我还没有解决如何使用连接而不是如何在一个链中拥有所有内容。
答案 0 :(得分:1)
对于切片和,您可以计算高于50%分位数的 rev 的总和,如下所示;然后你可以在同一个汇总表达式中计算它们而不需要连接:
df %>%
group_by(id) %>%
summarise(rev_sliced = sum(rev[rev > quantile(rev, 0.5)]),
rev = sum(rev))
# A tibble: 2 x 3
# id rev_sliced rev
# <int> <dbl> <dbl>
#1 1 225.5171 463.7502
#2 2 209.0919 463.9908