将数据减少到布尔值,同时在dplyr中保持唯一的周数

时间:2017-10-27 21:11:39

标签: r dplyr

鉴于以下数据

year    date        wk  name       type    holiday closed_day
2017    2017-11-27  48  NA          NA      0         0
2017    2017-12-04  49  NA          NA      0         0
2017    2017-12-11  50  NA          NA      0         0
2017    2017-12-18  51  NA          NA      0         0
2017    2017-12-25  52  Christmas   closed  0         1
2017    2017-12-26  52  NA          NA      0         0
2017    2017-12-31  52  NewYearsEve holiday 1         0

如何使用dplyr获取

year    date        wk  holiday closed_day
2017    2017-11-27  48    0       0
2017    2017-12-04  49    0       0
2017    2017-12-11  50    0       0
2017    2017-12-18  51    0       0
2017    2017-12-25  52    1       1

请注意,我每周都不需要姓名或类型,如果一周中有假期或者一个closed_day(不是总和,只是布尔值)

3 个答案:

答案 0 :(得分:2)

试试这个:

library(dplyr)

df %>% 
  group_by(wk) %>% 
  mutate(holiday = max(holiday) > 0,
         closed_day = max(closed_day) > 0) %>% 
  distinct(wk, .keep_all = TRUE) %>% 
  select(year, date, wk, holiday, closed_day)

给出了:

# A tibble: 5 x 5
# Groups:   wk [5]
   year       date    wk holiday closed_day
  <int>     <date> <int>   <lgl>      <lgl>
1  2017 2017-11-27    48   FALSE      FALSE
2  2017 2017-12-04    49   FALSE      FALSE
3  2017 2017-12-11    50   FALSE      FALSE
4  2017 2017-12-18    51   FALSE      FALSE
5  2017 2017-12-25    52    TRUE       TRUE
  1. 分组wk
  2. 通过询问每个的最大值是否大于0来将holidayclosed_day变为逻辑。
  3. 返回不同的wk
  4. 选择您想要的变量

答案 1 :(得分:2)

如果您对所获得的yeardate值有所了解,那么您可以使用:

library(dplyr)
df %>%
  group_by(wk) %>%
  summarize_at(vars(year, date, holiday, closed_day), funs(max(.)))
# # A tibble: 5 × 5
#      wk  year       date holiday closed_day
#   <int> <int>     <date>   <int>      <int>
# 1    48  2017 2017-11-27       0          0
# 2    49  2017 2017-12-04       0          0
# 3    50  2017 2017-12-11       0          0
# 4    51  2017 2017-12-18       0          0
# 5    52  2017 2017-12-31       1          1

否则

df %>%
  group_by(wk) %>%
  summarize(year = year[1], date = date[1],
            holiday = 1*any(holiday > 0),
            closed_day = 1*any(closed_day > 0))
# # A tibble: 5 × 5
#      wk  year       date holiday closed_day
#   <int> <int>     <date>   <dbl>      <dbl>
# 1    48  2017 2017-11-27       0          0
# 2    49  2017 2017-12-04       0          0
# 3    50  2017 2017-12-11       0          0
# 4    51  2017 2017-12-18       0          0
# 5    52  2017 2017-12-25       1          1

(我第二次对holidayclosed_day采用了稍微不同的方法,以防你有几周&#34;两个&#34;并且只需要> 0逻辑。 ..在这种情况下,保持logical而不是数字将是更清晰的代码/数据方式。)

答案 2 :(得分:2)

如果您对data.table方法感兴趣,我们可以这样做:

library(data.table)
setDT(df)[, .(date = date[1], holiday = any(holiday), closed = any(closed_day)), 
          by = .(year, wk)]

#    year wk       date holiday closed
# 1: 2017 48 2017-11-27   FALSE  FALSE
# 2: 2017 49 2017-12-04   FALSE  FALSE
# 3: 2017 50 2017-12-11   FALSE  FALSE
# 4: 2017 51 2017-12-18   FALSE  FALSE
# 5: 2017 52 2017-12-25    TRUE   TRUE

请注意,我按年和周汇总数据,假设您希望每年每周都有单独的摘要。