在dplyr中使用滞后并用均值来限制值

时间:2017-10-11 05:43:25

标签: r

我在r

中有以下数据框
   name    date         month    year     hours
   SSI     01-01-2016   01       2016      2000
   SSI     02-01-2016   01       2016      1900
   SSI     03-01-2016   01       2016      2038
   SSI     04-01-2016   01       2016      2041
   SSII    01-01-2016   01       2016      2000
   SSII    02-01-2016   01       2016      2100
   SSII    03-01-2016   01       2016      2105
   SSII    04-01-2016   01       2016      2203

我想为每个名字lag of hours月份和年份计算group by。我可以使用以下代码来执行此操作

  df1 <- df %>% 
    group_by(name,year,month) %>% 
    mutate(running_hrs = hours- lag(hours)) %>% 
    as.data.frame()

我想要的是running_hrs大于24或小于0,我想用这个月的平均值来限制这些值。我正在做以下。

  new_df <- df%>% 
    group_by(name,year,month) %>% 
    mutate(running_hrs = hours- lag(hours)) %>% 
    mutate(running_hrs_new = ifelse(running_hrs > 24 | running_hrs < 0,mean(running_hrs),running_hrs)) %>% 
    as.data.frame()

   name    date         month   year    hours   running_hrs running_hrs_new
   SSI     01-01-2016   01      2016    2000        NA         
   SSI     02-01-2016   01      2016    1900       -100            (3/4)
   SSI     03-01-2016   01      2016    2038        138            (3/4)
   SSI     04-01-2016   01      2016    2041        3                3   
   SSII    01-01-2016   01      2016    2000        NA           
   SSII    02-01-2016   01      2016    2100        100            (10/4) 
   SSII    03-01-2016   01      2016    2105        5                5   
   SSII    04-01-2016   01      2016    2110        5                5

值应替换为小于24且大于或等于零的运行小时数的平均值。我想我们可以使用条件均值

1 个答案:

答案 0 :(得分:1)

希望这有帮助!

library(dplyr)
library(tidyr)

new_df <- df%>% 
  group_by(name,year,month) %>% 
  mutate(running_hrs = hours- lag(hours)) %>% 
  mutate(valid_running_hrs= ifelse(running_hrs < 24 & running_hrs > 0,running_hrs,0)) %>%
  replace_na(list(valid_running_hrs=0)) %>%
  group_by(name,year,month) %>%
  mutate(running_hrs_new = ifelse(running_hrs > 24 | running_hrs < 0, mean(valid_running_hrs), running_hrs)) %>%
  as.data.frame()