使用dplyr

时间:2019-08-16 15:11:55

标签: r dplyr

我有一个数据框df,其中包含变量gfr和相应的gfr_date。我还有第二个数据框drug,它指示何时开始和停止药物。如ID 1所示,该药物可以多次启动和停止。

我想“合并”两个数据帧(wanted)以得到一个新变量(drug1),如果id在gfr_date处使用某种药物,则该变量为1,如果该变量使用了某种药物,则为0不是。

一些示例数据dfdrug

library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- structure(list(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 
                            2, 2, 3, 3, 3, 3), 
                     gfr = c(90, 109, 84, 81, 92, 76, 40, 64, 41, 
                             64, 28, 82, 61, 53, 47, 31, 29, 31, 31), 
                     gfr_date = structure(c(16812, 16996, 17178, 
                                            17372, 17542, 17731, 
                                            16686, 16741, 16898, 
                                            16909, 17098, 17120, 
                                            17295, 17442, 17668, 
                                            17283, 17435, 17568, 
                                            17758), class = "Date")), 
                row.names = c(NA, -19L), class = c("tbl_df", "tbl", "data.frame"))

df
#> # A tibble: 19 x 3
#>       id   gfr gfr_date  
#>    <dbl> <dbl> <date>    
#>  1     1    90 2016-01-12
#>  2     1   109 2016-07-14
#>  3     1    84 2017-01-12
#>  4     1    81 2017-07-25
#>  5     1    92 2018-01-11
#>  6     1    76 2018-07-19
#>  7     2    40 2015-09-08
#>  8     2    64 2015-11-02
#>  9     2    41 2016-04-07
#> 10     2    64 2016-04-18
#> 11     2    28 2016-10-24
#> 12     2    82 2016-11-15
#> 13     2    61 2017-05-09
#> 14     2    53 2017-10-03
#> 15     2    47 2018-05-17
#> 16     3    31 2017-04-27
#> 17     3    29 2017-09-26
#> 18     3    31 2018-02-06
#> 19     3    31 2018-08-15

drug <- structure(list(id = c(1, 1, 2, 3), 
               drug = c("drug1", "drug1", "drug1", "drug1"), 
               drstartd = structure(c(16261, 17008, 17443, 16252), class = "Date"), 
               drstopd = structure(c(16417, NA, NA, 17412), class = "Date")), 
          class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))

drug
#> # A tibble: 4 x 4
#>      id drug  drstartd   drstopd   
#>   <dbl> <chr> <date>     <date>    
#> 1     1 drug1 2014-07-10 2014-12-13
#> 2     1 drug1 2016-07-26 NA        
#> 3     2 drug1 2017-10-04 NA        
#> 4     3 drug1 2014-07-01 2017-09-03

wanted <- df %>% 
  mutate(drug1 = c(0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0))

wanted
#> # A tibble: 19 x 4
#>       id   gfr gfr_date   drug1
#>    <dbl> <dbl> <date>     <dbl>
#>  1     1    90 2016-01-12     0
#>  2     1   109 2016-07-14     0
#>  3     1    84 2017-01-12     1
#>  4     1    81 2017-07-25     1
#>  5     1    92 2018-01-11     1
#>  6     1    76 2018-07-19     1
#>  7     2    40 2015-09-08     0
#>  8     2    64 2015-11-02     0
#>  9     2    41 2016-04-07     0
#> 10     2    64 2016-04-18     0
#> 11     2    28 2016-10-24     0
#> 12     2    82 2016-11-15     0
#> 13     2    61 2017-05-09     0
#> 14     2    53 2017-10-03     0
#> 15     2    47 2018-05-17     1
#> 16     3    31 2017-04-27     1
#> 17     3    29 2017-09-26     0
#> 18     3    31 2018-02-06     0
#> 19     3    31 2018-08-15     0

reprex package(v0.3.0)于2019-08-19创建

我希望所需的数据帧正确无误,因为我手工完成了...谢谢您的帮助!

编辑 从reprex中删除了STATA属性。

1 个答案:

答案 0 :(得分:1)

我相信这应该可以满足您的要求。请注意,我将'drug'的id列设置为常规数字(我摆脱了'label =“ ID”,format.stata =“%12.0g”'),以便可以对'id'进行连接

df %>%
  left_join(drug, by = 'id') %>%
  mutate(drug1 = gfr_date >= drstartd & (gfr_date <= drstopd | is.na(drstopd))) %>%
  group_by(id, gfr, gfr_date) %>%
  summarize(drug1 = as.numeric(any(drug1))) %>%
  arrange(id, gfr_date)