当你的数据有多个" key"时,你如何使用spread()?变量?

时间:2018-01-17 23:49:32

标签: r dplyr tidyverse

编辑:为超过最小的例子道歉。我用一个更简约的例子来解决这个问题,看起来像aosmith的答案已经解决了!

这是this question之后的下一步,在同一过程中。这真是太过分了。

我有一个包含一系列变量的数据集,每个变量都有低,中和高值。还有多个识别变量,我在这里呼叫"场景"和"月"只是为了这个例子。我正在进行涉及3个不同值的计算,其中一些值具有低,中或高值,在每种情况下和每个月都有所不同。

# generating a practice dataset

library(dplyr)
library(tidyr)
set.seed(123)

pracdf <- bind_cols(expand.grid(ID = letters[1:2], 
                                month = 1:2, 
                                scenario = c("a", "b")),
                    data_frame(p.mid = runif(8, 100, 1000),
                               a = rep(runif(2), 4),
                               b = rep(runif(2), 4),
                               c = rep(runif(2), 4)))

pracdf <- pracdf %>% mutate(p.low = p.mid * 0.75,
                            p.high = p.mid * 1.25) %>%
  gather(p.low, p.mid, p.high, key = "ptype", value = "p") 

# all of that is just to generate the practice dataset.
# 2 IDs * 2 months * 2 scenarios * 3 different values of p = 24 total rows in this dataset

# Do the calculation

pracdf2 <- pracdf %>%
  mutate(result = p * a * b * c)

这完全&#34;聚集&#34;数据集具有我想要的结果。让我们做一个扩展类型的操作,以一种更具可读性的方式来实现这一点,每个月,场景和p型组合都有它自己的列。示例列名称为&quot; month1_scenario.a_p.low&#39;。此数据集的总数为2个月* 3 p类型* 2个方案= 12列。

# this fully "gathered" dataset is exactly what I want. 
# Let's put it in a format that the supervisor for this project will be happy with
# ID, month, scenario, and p.type are all "key" variables
# spread() only allows one key variable at a time, so...

pracdf2.spread1 <- pracdf2 %>% spread(ptype, result, sep = ".")
# Produces NA's. Looks like it's messing up with the different values of p

pracdf2.spread2 <-  pracdf2 %>% select(-p) %>% spread(ptype, result, sep = ".")
# that's better, now let's spread across scenarios

pracdf2.spread2.spread2low <- pracdf2.spread2 %>% select(-ptype.p.high, -ptype.p.mid) %>% spread(scenario, ptype.p.low, sep = ".")
pracdf2.spread2.spread2mid <- pracdf2.spread2 %>% select(-ptype.p.low, -ptype.p.high) %>% spread(scenario, ptype.p.mid, sep = ".")
pracdf2.spread2.spread2high <- pracdf2.spread2 %>% select(-ptype.p.mid, -ptype.p.low) %>% spread(scenario, ptype.p.high, sep = ".")

pracdf2.spread2.spread2 <- pracdf2.spread2.spread2low %>% left_join(pracdf2.spread2.spread2mid)

# Ok, that was rough and will clearly spiral out of control quickly
# what am I still doing with my life?

我可以使用spread()来传播每个键列,然后为每个后续值列重做点差,但这需要很长时间,并且可能容易出错。

这样做有更清洁,更整洁,更时尚的方法吗?

谢谢!

1 个答案:

答案 0 :(得分:3)

您可以使用 tidyr 中的unite在展开之前将三列合并为一列。

然后您可以spread使用新列作为key,将“结果”作为value

在传播之前,我还删除了“a”到“p”列,因为在所需的结果中似乎不需要这些。

pracdf2 %>%
     unite("allgroups", month, scenario, ptype) %>%
     select(-(a:p)) %>%
     spread(allgroups, result)

# A tibble: 2 x 13
  ID    `1_a_p.high` `1_a_p.low` `1_a_p.mid` `1_b_p.high` `1_b_p.low` `1_b_p.mid` `2_a_p.high` `2_a_p.low`
  <fct>        <dbl>       <dbl>       <dbl>        <dbl>       <dbl>       <dbl>        <dbl>       <dbl>
1 a              160        96.2       128          423         254         338            209       126  
2 b              120        72.0        96.0         20.9        12.5        16.7          133        79.5
# ... with 4 more variables: `2_a_p.mid` <dbl>, `2_b_p.high` <dbl>, `2_b_p.low` <dbl>, `2_b_p.mid` <dbl>