R& dplyr:总结()之后如何在原始数据框/数据表中包含更多列?

时间:2016-04-22 02:43:20

标签: r dplyr

换句话说,如何汇总一列(例如column),同时保持另一列(例如location)?

这个MWE说明了我的问题。执行location后,如何在summarise()列中添加回来?在summarise()之前是否有一些涉及“上升级别”的解决方案,以便我可以维护原始列?

test <- as.data.table(data.frame(event_id = c("A","B","A","A","B"),
                                 income = c(1,2,3,4,5),
                                 location = c("PlaceX","PlaceY","PlaceX","PlaceX","PlaceY")))

test

   event_id income location
1:        A      1   PlaceX
2:        B      2   PlaceY
3:        A      3   PlaceX
4:        A      4   PlaceX
5:        B      5   PlaceY

test %>%
  group_by(event_id) %>%
  summarise(mean_inc = mean(income))

Source: local data table [2 x 2]

  event_id mean_inc
    (fctr)    (dbl)
1        A 2.666667
2        B 3.500000

以下不起作用:

test %>%
  group_by(event_id) %>%
  summarise(mean_inc = mean(income),
  location = location)

Source: local data table [5 x 3]

  event_id mean_inc location
    (fctr)    (dbl)   (fctr)
1        A 2.666667   PlaceX
2        A 2.666667   PlaceX
3        A 2.666667   PlaceX
4        B 3.500000   PlaceY
5        B 3.500000   PlaceY

我想要的输出是:

Source: local data table [2 x 3]

  event_id location mean_inc
    (fctr)   (fctr)    (dbl)
1        A   PlaceX 2.666667
2        B   PlaceY 3.500000

2 个答案:

答案 0 :(得分:1)

我希望我理解你的欲望。执行inner_join以恢复缺失的列(假设它们与group_by参数匹配为1-1):

 newtest <- test %>%
   group_by(event_id) %>%
   summarise(mean_inc = mean(income)) %>% inner_join(test[-(1:2)])
#Joining by: "event_id"
 newtest
#-----------------
Source: local data table [3 x 4]

  event_id mean_inc income location
    (fctr)    (dbl)  (dbl)   (fctr)
1        A 2.666667      3   PlaceX
2        A 2.666667      4   PlaceX
3        B 3.500000      5   PlaceY

您也希望在event_id和location上匹配:

  test %>%
   group_by(event_id,location) %>%
   summarise(mean_inc = mean(income))
#---------
#Source: local data table [2 x 3]
#Groups: event_id

  event_id location mean_inc
    (fctr)   (fctr)    (dbl)
1        A   PlaceX 2.666667
2        B   PlaceY 3.500000

答案 1 :(得分:0)

选项可能是使用mutate,然后通过distinct为每个组提取一个值。

这个用途取决于实际用例:如果你的新变量与它总结的原始变量同名,这似乎最有用。否则,您最终会在最终数据集中获得原始的,未经过更新的变量。

distinct在这里工作,因为该对象仍然是分组的。

test %>% 
    group_by(event_id) %>%
    mutate(income = mean(income)) %>%
    distinct()

Source: local data table [2 x 3]

  event_id   income location
    (fctr)    (dbl)   (fctr)
1        A 2.666667   PlaceX
2        B 3.500000   PlaceY

dplyr_0.4.3.9000 中,您需要.keep_all = TRUE中的distinct