With a df below,
metric
across the teams tm1, tm2 and tm3 on a per locid
, day
, hour
combo basisthen filter only those locid
, day
, hour
observations which have the same metric
median across teams tm1
, tm2
, tm3
.
set.seed(100)
df <- data.frame(
locid = sample(c(1111,1122,1133), 20, replace=TRUE),
day = sample(c(1:3), 20, replace=TRUE),
hour = sample(c(1:4), 20, replace=TRUE),
team = sample(c("tm1", "tm2", "tm3"), 20, replace=TRUE),
metric = sample(1:5, 20, replace=TRUE )
)
my attempt
df_medians <- df %>%
group_by(locid + day + hour + team) %>%
summarise(metric_median = median(metric))
this gives the median
per team
for each locid + day + hour
. I need to now find out the locid + day + hour
combos that give the same median value across teams tm1
, tm2
, tm3
.
df_medians %>% group_by(locid, day, hour, team) %>% summarise(??what here??)
I was trying with dplyr, but base-r solution is fine.
As a simpler example we can look at the below data- which has measurements from two different locations for two teams.
+-------+------+-------+-------+---------+
| locid | day | hour | team | metric |
+-------+------+-------+-------+---------+
| 1111 | 1 | 1 | tm1 | 3 |
| 1111 | 1 | 1 | tm1 | 2 |
| 1111 | 1 | 1 | tm1 | 1 |
| 1111 | 1 | 1 | tm2 | 1 |
| 1111 | 1 | 1 | tm2 | 2 |
| 1111 | 1 | 1 | tm2 | 3 |
| 1122 | 1 | 1 | tm1 | 3 |
| 1122 | 1 | 1 | tm1 | 2 |
| 1122 | 1 | 1 | tm1 | 1 |
| 1122 | 1 | 1 | tm2 | 1 |
| 1122 | 1 | 1 | tm2 | 2 |
| 1122 | 1 | 1 | tm2 | 1 |
+-------+------+-------+-------+---------+
step 1 - compute median by group
+-------+------+-------+-------+-------------+
| locid | day | hour | team | metric_med |
+-------+------+-------+-------+-------------+
| 1111 | 1 | 1 | tm1 | 2 |
| 1111 | 1 | 1 | tm2 | 2 |
| 1122 | 1 | 1 | tm1 | 2 |
| 1122 | 1 | 1 | tm2 | 1 |
+-------+------+-------+-------+-------------+
Step2 - compare medians across group (locid + day + hour) only (1111, 1, 1) has the metric_med same across the teams gp1 and gp2
+-------+------+-------+-------------+
| locid | day | hour | metric_med |
+-------+------+-------+-------------+
| 1111 | 1 | 1 | 2 |
+-------+------+-------+-------------+
答案 0 :(得分:0)
一种方法是将每个locid,day和hour分组成一行,然后进行比较。该解决方案适用于两组以上且复杂的条件。
library(dplyr)
library(tidyr)
data %>%
group_by(locid, day, hour, team) %>%
summarize(median = median(metric)) %>%
spread(team, median) %>%
filter(tm1 == tm2)
另一种可能的解决方案是按地点,日和小时排列汇总结果,然后将一行中的中位数与其lag
进行比较。此解决方案仅适用于团队中的两个小组。
data %>%
group_by(locid, day, hour, team) %>%
summarize(median = median(metric)) %>%
arrange(locid, day, hour) %>%
filter(median == lag(median))
答案 1 :(得分:0)
让我们重新演绎“所有人”等同于&#39;表示&#34;零差异或单一观察&#34;。因此:
df %>%
# per locid, day, hour, team
group_by(locid, day, hour, team) %>%
# compute median
summarize(team_median = median(metric)) %>%
# ungroup before specifying new grouping
ungroup %>%
# for locid, day, hour
group_by(locid, day, hour) %>%
# find the medians that were the same for all teams
# 'the same' here is taken to mean no variance
# or having a single observation
# note that, although logical vector TRUE | NA does yield TRUE
# this is only because it must yield TRUE.
# As another example, FALSE | NA, yields NA.
# As a guard against team_medians that are NA, I add a coalesce wrapper.
# I've decided that missing team_medians represent non-cases, YMMV
summarize(all_equal = coalesce(n() == 1 | var(team_median) == 0), FALSE) %>%
filter(all_equal == TRUE) %>%
select(-all_equal)