切片data.frame并加入旧的data.frame而不切片

时间:2017-07-06 19:07:28

标签: r dplyr

我正在通过删除收入最低的50%的行来切换data.frame,现在我想加入旧的data.frame,这样我就可以在切片之前将结果与切片进行比较。

我有一个解决方案,但寻找更优雅。

require(dplyr)

> #creating my data.frame with revenue for id and subid    
> df <- data.frame(id = gl(n = 2, k= 5, length = 10),
+                   subid = gl(n = 6, k = 2, length = 10),
+                   rev = rnorm(10, 100, 15)) 
> df
   id subid       rev
1   1     1 102.80694
2   1     1  77.88691
3   1     2 122.71019
4   1     2  67.13475
5   1     3  93.21146
6   2     3  91.48368
7   2     4 103.05535
8   2     4  82.27343
9   2     5 106.03651
10  2     5  81.14182
> 
> #keep only subid with 50% highest turnover within each id  
> df_sliced <-  df %>% 
+     arrange(id, desc(rev)) %>%
+     group_by(id) %>% 
+     slice(seq(n()*0.5)) %>%
+     group_by(id) %>% 
+     summarise(rev_sliced = sum(rev))
> 
> df_sliced
Source: local data frame [2 x 2]

      id rev_sliced
  (fctr)      (dbl)
1      1   225.5171
2      2   209.0919
> 
> #now I want to join back and compare my sliced result with result before slice. 
> df_desired <- df %>% 
+   group_by(id) %>% 
+   summarise(rev = sum(rev)) %>% 
+   cbind(df_sliced) #this will obviously also give me two columns with id. Desired result is with only one column for id. 
> 
> df_desired
  id      rev id rev_sliced
1  1 463.7503  1   225.5171
2  2 463.9908  2   209.0919

我还没有解决如何使用连接而不是如何在一个链中拥有所有内容。

1 个答案:

答案 0 :(得分:1)

对于切片和,您可以计算高于50%分位数的 rev 的总和,如下所示;然后你可以在同一个汇总表达式中计算它们而不需要连接:

df %>% 
    group_by(id) %>% 
    summarise(rev_sliced = sum(rev[rev > quantile(rev, 0.5)]), 
              rev = sum(rev))

# A tibble: 2 x 3
#     id rev_sliced      rev
#  <int>      <dbl>    <dbl>
#1     1   225.5171 463.7502
#2     2   209.0919 463.9908