计算组之间的差异:用于多种计算的替代方法

时间:2019-01-09 21:18:21

标签: r dplyr tidyr

我通常需要计算组之间的差异,并按一定的间隔和/或其他分组进行嵌套。对于计算单个变量,使用spreadmutate可以轻松实现。这是数据集ChickWeight的可复制示例;不要被计算本身分散注意力(这只是一个玩具示例),我的问题是如何处理像下面创建的数据框ChickSum那样的数据集。

# reproducible dataset
data(ChickWeight)
ChickSum = ChickWeight %>% 
  filter(Time == max(Time) | Time == min(Time)) %>%
  group_by(Diet, Time) %>% 
  summarize(mean.weight = mean(weight)) %>%
  ungroup()

这是我如何计算第一次和最后一次之间的平均雏鸡体重变化(按饮食分层):

# Compute change in mean weight between first and last time
ChickSum %>%
  spread(Time, mean.weight) %>%
  mutate(weight.change = `21` - `0`)

但是,这对于多个变量来说效果不佳:

ChickSum2 = ChickWeight %>% 
  filter(Time == max(Time) | Time == min(Time)) %>%
  group_by(Diet, Time) %>% 
  # now also compute variable "count"
  summarize(count = n(), mean.weight = mean(weight)) %>%
  ungroup()

我不能用Time以及countmean.weight来传播;我当前的解决方案是执行两次spread-mutate操作-一次执行count,然后再次执行mean.weight--然后执行join结果。

ChickCountChange = ChickSum2 %>%
  select(-mean.weight) %>%
  spread(Time, count) %>%
  mutate(count.change = `21` - `0`)
ChickWeightChange = ChickSum2 %>%
  select(-count) %>%
  spread(Time, mean.weight) %>%
  mutate(weight.change = `21` - `0`)

full_join(
  select(ChickWeightChange, Diet, weight.change), 
  select(ChickCountChange, Diet, count.change), 
  by = "Diet")

是否存在另一种用于此类计算的方法?我一直在尝试构想一种将group_bypurrr::pmap组合在一起的策略,以避免{{1 }},但仍保留了上述方法的优点(例如spread的{​​{1}}参数用于选择如何处理丢失的组组合),但我还没有弄清楚。我乐于接受有关问题的建议或替代数据结构/思路。

2 个答案:

答案 0 :(得分:1)

您可以尝试重新分组,然后使用lag()计算差异。适用于您的玩具示例,但最好查看一些真实数据集:

ChickWeight %>% 
  filter(Time == max(Time) | Time == min(Time)) %>%
  group_by(Diet, Time) %>% 
  # now also compute variable "count"
  summarize(count = n(), mean.weight = mean(weight)) %>%
  ungroup() %>% 
  group_by(Diet) %>% 
  mutate(count.change = count - lag(count), 
         weight.change = mean.weight - lag(mean.weight)) %>% 
  filter(Time == max(Time))

结果:

  Diet   Time count mean.weight count.change weight.change
  <fct> <dbl> <int>       <dbl>        <int>         <dbl>
1 1        21    16        178.           -4          136.
2 2        21    10        215.            0          174 
3 3        21    10        270.            0          230.
4 4        21     9        239.           -1          198.

答案 1 :(得分:0)

因此,在编写可复制示例的过程中,我提出了一个潜在/部分解决方案。本质上,我们使用gather对变量本身进行分组:

ChickSum2 %>% 
  gather(variable, value, count, mean.weight) %>% 
  spread(Time, value) %>% mutate(Change = `21` - `0`) %>% 
  select(Diet, variable, Change) %>% 
  spread(variable, Change)

这仅在以下两个条件为真时有效:

  1. 所有变量都是同一类型(例如mean.weightcount都是数字)。
  2. 所有变量的差异计算都相同(例如,我想为所有变量计算last - first)。

我猜想第二个条件可以通过使用case_when