如何使用dplyr收集事件的多个实例并创建整齐的tibble

时间:2017-03-17 17:03:38

标签: r dplyr tidyr tidyverse

我有一个与此类似的数据集:

library(tidyverse)

df <- tibble(
  subjid = 1:5,
  event_1 = c("Watery eyes",         # Event number 1 
          "Sore throat",
          "Vomiting",
          "Gastroenteritis viral",
          "Dry Mouth"),
  start_date_1 = as.Date("2017-01-02") + 0:4,
  stop_date_1 = as.Date("2017-01-03") + 0:4,
  severity_1 = 1,
  related_to_drug_1 = 0,
  event_2 = c("Nausea",             # Event number 2
          "Dizziness",
          "Cough",
          "Disorientation",
          "Diarrhea"),
  start_date_2 = as.Date("2017-02-02") + 0:4,
  stop_date_2 = as.Date("2017-02-03") + 0:4,
  severity_2 = 2,
  related_to_drug_2 = 1,
  event_3 = c("Eczema",             # Event number 3
          "Sinusitis",
          "Abdominal discomfort",
          "Muscle spasms",
          "Nasopharyngitis"),
  start_date_3 = as.Date("2017-03-02") + 0:4,
  stop_date_3 = as.Date("2017-03-03") + 0:4,
  severity_3 = 2,
  related_to_drug_3 = 1
)
df

# A tibble: 5 × 16
  subjid               event_1 start_date_1 stop_date_1 severity_1 related_to_drug_1        event_2 start_date_2 stop_date_2 severity_2 related_to_drug_2              event_3
   <int>                 <chr>       <date>      <date>      <dbl>             <dbl>          <chr>       <date>      <date>      <dbl>             <dbl>                <chr>
1      1           Watery eyes   2017-01-02  2017-01-03          1                 0         Nausea   2017-02-02  2017-02-03          2                 1               Eczema
2      2           Sore throat   2017-01-03  2017-01-04          1                 0      Dizziness   2017-02-03  2017-02-04          2                 1            Sinusitis
3      3              Vomiting   2017-01-04  2017-01-05          1                 0          Cough   2017-02-04  2017-02-05          2                 1 Abdominal discomfort
4      4 Gastroenteritis viral   2017-01-05  2017-01-06          1                 0 Disorientation   2017-02-05  2017-02-06          2                 1        Muscle spasms
5      5             Dry Mouth   2017-01-06  2017-01-07          1                 0       Diarrhea   2017-02-06  2017-02-07          2                 1      Nasopharyngitis
# ... with 4 more variables: start_date_3 <date>, stop_date_3 <date>, severity_3 <dbl>, related_to_drug_3 <dbl>

但是,还有更多的数据行和超过100个&#34;事件&#34; /列系列。数据框由每个主题的行组成,包含不良事件及其相关属性,列在以下划线命名的列中,以指示它们属于哪个事件。我想用tidyr将这些事件收集到像这样的元素中:

# A tibble: 15 × 7
   subjid event_number                 event start_date  stop_date severity related_to_drug
    <int>        <int>                 <chr>     <date>     <date>    <int>                <int>
1       1            1           Watery eyes 2017-01-02 2017-01-03        1                    0
2       2            1           Sore throat 2017-01-03 2017-01-04        1                    0
3       3            1              Vomiting 2017-01-04 2017-01-05        1                    0
4       4            1 Gastroenteritis viral 2017-01-05 2017-01-06        1                    0
5       5            1             Dry Mouth 2017-01-06 2017-01-07        1                    0
6       1            2                Nausea 2017-02-02 2017-02-03        2                    1
7       2            2             Dizziness 2017-02-03 2017-02-04        2                    1
8       3            2                 Cough 2017-02-04 2017-02-05        2                    1
9       4            2        Disorientation 2017-02-05 2017-02-06        2                    1
10      5            2              Diarrhea 2017-02-06 2017-02-07        2                    1
11      1            3                Eczema 2017-03-02 2017-03-03        3                    2
12      2            3             Sinusitis 2017-03-03 2017-03-04        3                    2
13      3            3  Abdominal discomfort 2017-03-04 2017-03-05        3                    2
14      4            3         Muscle spasms 2017-03-05 2017-03-06        3                    2
15      5            3       Nasopharyngitis 2017-03-06 2017-03-07        3                    2

每个不良事件都有一行,标识该特定事件的属性列。

2 个答案:

答案 0 :(得分:1)

您可以使用以下代码执行此操作:

df %>%
  gather(Var,Val,-1) %>%
  mutate(Var = gsub('_(\\d+)','!!\\1',Var)) %>% 
  separate(Var,c('Var','Event'),sep = '!!') %>%
  spread(Var,Val)

不幸的是,这会破坏列的类,并且需要修复,您可以通过调用mutate来执行此操作。

(另请注意,收集后的mutate行只是因为您的列名中包含'_'而我想拆分事件编号。)

答案 1 :(得分:1)

这是一种更复杂的方式,但非常重要的是,保留了类
从列名开始,根据事件编号拆分它们,然后为每个事件创建一个数据帧,最后将它们垂直堆叠:

names(df) %>% 
  setdiff("subjid") %>% 
  split(sub(".*_(\\d+)$", "\\1", x = .)) %>% 
  map(~ select_(.data = df, .dots = c("subjid", .x))) %>% 
  map(~ setNames(.x, nm = sub("(.*)_\\d+$", "\\1", x = names(.x)))) %>%
  map2(names(.), ~ mutate(.x, event_number = .y)) %>% 
  bind_rows() %>% 
  select(subjid, event_number, everything())
# # A tibble: 15 × 7
#    subjid event_number                 event start_date  stop_date severity related_to_drug
#     <int>        <chr>                 <chr>     <date>     <date>    <dbl>           <dbl>
# 1       1            1           Watery eyes 2017-01-02 2017-01-03        1               0
# 2       2            1           Sore throat 2017-01-03 2017-01-04        1               0
# 3       3            1              Vomiting 2017-01-04 2017-01-05        1               0
# 4       4            1 Gastroenteritis viral 2017-01-05 2017-01-06        1               0
# 5       5            1             Dry Mouth 2017-01-06 2017-01-07        1               0
# 6       1            2                Nausea 2017-02-02 2017-02-03        2               1
# 7       2            2             Dizziness 2017-02-03 2017-02-04        2               1
# 8       3            2                 Cough 2017-02-04 2017-02-05        2               1
# 9       4            2        Disorientation 2017-02-05 2017-02-06        2               1
# 10      5            2              Diarrhea 2017-02-06 2017-02-07        2               1
# 11      1            3                Eczema 2017-03-02 2017-03-03        2               1
# 12      2            3             Sinusitis 2017-03-03 2017-03-04        2               1
# 13      3            3  Abdominal discomfort 2017-03-04 2017-03-05        2               1
# 14      4            3         Muscle spasms 2017-03-05 2017-03-06        2               1
# 15      5            3       Nasopharyngitis 2017-03-06 2017-03-07        2               1