我通常需要计算组之间的差异,并按一定的间隔和/或其他分组进行嵌套。对于计算单个变量,使用spread
和mutate
可以轻松实现。这是数据集ChickWeight
的可复制示例;不要被计算本身分散注意力(这只是一个玩具示例),我的问题是如何处理像下面创建的数据框ChickSum
那样的数据集。
# reproducible dataset
data(ChickWeight)
ChickSum = ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
summarize(mean.weight = mean(weight)) %>%
ungroup()
这是我如何计算第一次和最后一次之间的平均雏鸡体重变化(按饮食分层):
# Compute change in mean weight between first and last time
ChickSum %>%
spread(Time, mean.weight) %>%
mutate(weight.change = `21` - `0`)
但是,这对于多个变量来说效果不佳:
ChickSum2 = ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
# now also compute variable "count"
summarize(count = n(), mean.weight = mean(weight)) %>%
ungroup()
我不能用Time
以及count
和mean.weight
来传播;我当前的解决方案是执行两次spread
-mutate
操作-一次执行count
,然后再次执行mean.weight
--然后执行join
结果。
ChickCountChange = ChickSum2 %>%
select(-mean.weight) %>%
spread(Time, count) %>%
mutate(count.change = `21` - `0`)
ChickWeightChange = ChickSum2 %>%
select(-count) %>%
spread(Time, mean.weight) %>%
mutate(weight.change = `21` - `0`)
full_join(
select(ChickWeightChange, Diet, weight.change),
select(ChickCountChange, Diet, count.change),
by = "Diet")
是否存在另一种用于此类计算的方法?我一直在尝试构想一种将group_by
和purrr::pmap
组合在一起的策略,以避免{{1 }},但仍保留了上述方法的优点(例如spread
的{{1}}参数用于选择如何处理丢失的组组合),但我还没有弄清楚。我乐于接受有关问题的建议或替代数据结构/思路。
答案 0 :(得分:1)
您可以尝试重新分组,然后使用lag()
计算差异。适用于您的玩具示例,但最好查看一些真实数据集:
ChickWeight %>%
filter(Time == max(Time) | Time == min(Time)) %>%
group_by(Diet, Time) %>%
# now also compute variable "count"
summarize(count = n(), mean.weight = mean(weight)) %>%
ungroup() %>%
group_by(Diet) %>%
mutate(count.change = count - lag(count),
weight.change = mean.weight - lag(mean.weight)) %>%
filter(Time == max(Time))
结果:
Diet Time count mean.weight count.change weight.change
<fct> <dbl> <int> <dbl> <int> <dbl>
1 1 21 16 178. -4 136.
2 2 21 10 215. 0 174
3 3 21 10 270. 0 230.
4 4 21 9 239. -1 198.
答案 1 :(得分:0)
因此,在编写可复制示例的过程中,我提出了一个潜在/部分解决方案。本质上,我们使用gather
对变量本身进行分组:
ChickSum2 %>%
gather(variable, value, count, mean.weight) %>%
spread(Time, value) %>% mutate(Change = `21` - `0`) %>%
select(Diet, variable, Change) %>%
spread(variable, Change)
这仅在以下两个条件为真时有效:
mean.weight
和count
都是数字)。last - first
)。我猜想第二个条件可以通过使用case_when
。