Combining multiple summary statistics in dplyr analysis

时间:2018-08-22 13:48:11

标签: r dplyr plyr

For a sample dataframe:

df1 <- structure(list(practice = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), drug = c("123A456", 
"123A567", "123A123", "123A567", "123A456", "123A123", "123A567", 
"123A567", "998A125", "123A456", "998A125", "123A567", "123A456", 
"998A125", "123A567", "123A567", "123A567", "998A125", "123A123", 
"998A125", "123A123", "123A456", "998A125", "123A567", "998A125", 
"123A456", "123A123", "998A125", "123A567", "123A567", "998A125", 
"123A456", "123A123", "123A567", "123A567", "998A125", "123A456"
), items = c(1, 2, 3, 4, 5, 4, 6, 7, 8, 9, 5, 6, 7, 8, 9, 4, 
5, 6, 3, 2, 3, 4, 5, 6, 7, 4, 3, 2, 3, 4, 5, 4, 3, 4, 5, 6, 4
), quantity = c(1, 2, 4, 5, 3, 2, 3, 5, 4, 5, 7, 9, 5, 3, 4, 
6, 1, 2, 4, 5, 3, 2, 3, 5, 4, 5, 7, 9, 5, 3, 4, 6, 1, 2, 4, 5, 
3)), .Names = c("practice", "drug", "items", "quantity"), row.names = c(NA, 
-37L), spec = structure(list(cols = structure(list(practice = structure(list(), class = c("collector_integer", 
"collector")), drug = structure(list(), class = c("collector_character", 
"collector")), items = structure(list(), class = c("collector_integer", 
"collector")), quantity = structure(list(), class = c("collector_integer", 
"collector"))), .Names = c("practice", "drug", "items", "quantity"
)), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class = c("tbl_df", 
"tbl", "data.frame"))

I want to do various analysis. I think dplyr will be my solution, but I am struggling how to put the function together.

My dataframe is a list of drugs which I want to summarise some of those drugs (as defined by the first three digits of their drug code).

  1. I want to report the sum of those type of drugs (starting with 123) - drug123.items and drug123.quantity BY practice.

  2. I also want to report the totals for all the drugs (all_items and all_quantity) for all of the drugs in my dataframe (I'll eventually express drug123 as a percentage of all the drugs).

I can do bits of the analysis individually i.e. summarise the total items by this for example:

practice <- df1 %>% 
  group_by(practice) %>% 
  summarise(all.items = sum(items))

... and this to only look at the drugs I am interested in...

drug123 <- df1 %>% 
  filter(substr(drug, 1,3)==123)


ALL.drug123 <- aggregate(drug123$quantity, by=list(Category=drug123$practice), FUN=sum)

But how do I put everything together?

I want a dataframe with the following columns:

practice (1,2,3 in the dataframe given).

drug123.items #for drug123

drug123.quantity #for drug123

all.items #for all drugs

all.quantity #for all drugs

Any ideas?

1 个答案:

答案 0 :(得分:1)

我认为这是您想要的:

df1 %>%
  group_by(practice) %>%
  summarize(items_123 = sum(if_else(stringr::str_detect(drug, '^123'), items, 0)),
            quantity_123 = sum(if_else(stringr::str_detect(drug, '^123'), quantity, 0)),
            all_items = sum(items),
            all_quantity = sum(quantity))

# A tibble: 3 x 5
  practice items_123 quantity_123 all_items all_quantity
     <int>     <dbl>        <dbl>     <dbl>        <dbl>
1        1        54           44        75           58
2        2        44           42        66           65
3        3        24           19        35           28