compare aggregate value across groups

时间:2017-03-19 02:10:28

标签: r dplyr

With a df below,

  1. need to compute median for variable metric across the teams tm1, tm2 and tm3 on a per locid, day, hour combo basis
  2. then filter only those locid, day, hour observations which have the same metric median across teams tm1, tm2, tm3.

    set.seed(100)
    df <- data.frame(
        locid = sample(c(1111,1122,1133), 20, replace=TRUE),
        day = sample(c(1:3), 20, replace=TRUE),
        hour = sample(c(1:4), 20, replace=TRUE),
        team = sample(c("tm1", "tm2", "tm3"), 20, replace=TRUE),
        metric = sample(1:5, 20, replace=TRUE )
    )
    

my attempt

df_medians <- df %>% 
                group_by(locid + day + hour + team) %>%
                  summarise(metric_median = median(metric))

this gives the median per team for each locid + day + hour. I need to now find out the locid + day + hour combos that give the same median value across teams tm1, tm2, tm3.

df_medians %>% group_by(locid, day, hour, team) %>% summarise(??what here??)

I was trying with dplyr, but base-r solution is fine.

As a simpler example we can look at the below data- which has measurements from two different locations for two teams.

+-------+------+-------+-------+---------+
| locid |  day |  hour |  team |  metric |
+-------+------+-------+-------+---------+
|  1111 |    1 |     1 |  tm1  |       3 |
|  1111 |    1 |     1 |  tm1  |       2 |
|  1111 |    1 |     1 |  tm1  |       1 |

|  1111 |    1 |     1 |  tm2  |       1 |
|  1111 |    1 |     1 |  tm2  |       2 |
|  1111 |    1 |     1 |  tm2  |       3 |

|  1122 |    1 |     1 |  tm1  |       3 |
|  1122 |    1 |     1 |  tm1  |       2 |
|  1122 |    1 |     1 |  tm1  |       1 |

|  1122 |    1 |     1 |  tm2  |       1 |
|  1122 |    1 |     1 |  tm2  |       2 |
|  1122 |    1 |     1 |  tm2  |       1 |
+-------+------+-------+-------+---------+

step 1 - compute median by group

+-------+------+-------+-------+-------------+
| locid |  day |  hour |  team |  metric_med |
+-------+------+-------+-------+-------------+
|  1111 |    1 |     1 |  tm1  |       2     |
|  1111 |    1 |     1 |  tm2  |       2     |
|  1122 |    1 |     1 |  tm1  |       2     |
|  1122 |    1 |     1 |  tm2  |       1     |
+-------+------+-------+-------+-------------+

Step2 - compare medians across group (locid + day + hour) only (1111, 1, 1) has the metric_med same across the teams gp1 and gp2

+-------+------+-------+-------------+
| locid |  day |  hour |  metric_med |
+-------+------+-------+-------------+
|  1111 |    1 |     1 |       2     |
+-------+------+-------+-------------+

2 个答案:

答案 0 :(得分:0)

一种方法是将每个locid,day和hour分组成一行,然后进行比较。该解决方案适用于两组以上且复杂的条件。

library(dplyr)
library(tidyr)

data %>% 
  group_by(locid, day, hour, team) %>% 
  summarize(median = median(metric)) %>%
  spread(team, median) %>% 
  filter(tm1 == tm2)

另一种可能的解决方案是按地点,日和小时排列汇总结果,然后将一行中的中位数与其lag进行比较。此解决方案仅适用于团队中的两个小组。

data %>% 
  group_by(locid, day, hour, team) %>% 
  summarize(median = median(metric)) %>%
  arrange(locid, day, hour) %>% 
  filter(median == lag(median))

答案 1 :(得分:0)

让我们重新演绎“所有人”等同于&#39;表示&#34;零差异或单一观察&#34;。因此:

df %>% # per locid, day, hour, team group_by(locid, day, hour, team) %>% # compute median summarize(team_median = median(metric)) %>% # ungroup before specifying new grouping ungroup %>% # for locid, day, hour group_by(locid, day, hour) %>% # find the medians that were the same for all teams # 'the same' here is taken to mean no variance # or having a single observation # note that, although logical vector TRUE | NA does yield TRUE # this is only because it must yield TRUE. # As another example, FALSE | NA, yields NA. # As a guard against team_medians that are NA, I add a coalesce wrapper. # I've decided that missing team_medians represent non-cases, YMMV summarize(all_equal = coalesce(n() == 1 | var(team_median) == 0), FALSE) %>% filter(all_equal == TRUE) %>% select(-all_equal)