dplyr:在每组的末尾添加一个新行,根据上一行的变量计算得出

时间:2019-01-27 15:21:01

标签: r dplyr event-log

关键问题

我能够用上一行的值填充新行。我可以在新行中将常量分配给var。但是我不能 根据前几行计算值,然后在新行中分配它们。

背景

我有来自PLC的真实数据,准备转换为与bupaR一起使用的事件日志。 以下数据受到限制和简化,但包含有关资源,时间戳,状态类型和event_ID的信息。

已经实现

  • 我添加了Error_ID,Error_startTS,Error_EndTS和生命周期的一部分,如另一个SO question
  • 中所述
  • 错误定义为以state_type ==“ error”开头的一系列事件,直到遇到以下事件为止: 除了“错误”,“ Comlink Down”,“无效”之外,什么都没有。
  • 错误号已分配给同一“错误跟踪”(“ Error_ID”)的所有行
  • 错误的开始时间(第一个错误行的时间戳)已分配(“ Error_startTS”)
  • 错误的结束时间,即错误之后的第一行的时间戳,换句话说 分配了结束错误的事件的时间戳(“ Error_endTS”)
  • 已将“ Life_cycle_ID”分配给错误的行,即“开始”或“进行中”。

目标:

现在,我要插入新行

  • with Life_cycle_id ==“完成”
  • 每个“错误跟踪”的“进行中”的最后一行之后

详细信息

  • 可与fill()一起使用:从最后一行复制
    • “资源”
    • “错误ID”,
    • “ Error_startTS”,
    • “ Error_endTS”
  • 可与add.row()一起使用:分配一个常量
    • “生命周期ID”应为“完整”
    • “状态类型”应为“错误”
  • 对我来说有问题:根据前几行的值分配值
    • 时间戳“ Datetime_local”应在组中获得“ Error_endTS”的值
    • “ event_ID”应增加1

数据

my_df <- structure(
  list(Resource = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
                            .Label = c("L54", "L60", "L66", "L68", "L70", "L76", 
                                       "L78", "L95", "L96", "L97", "L98", "L99"), 
                            class = "factor"), 
       Datetime_local = structure(c(1535952594, 1535952618, 1535952643, 1535952651, 
                                    1535952787, 1535952835, 1535952840, 1535952846, 
                                    1535952890, 1535952949, 1535952952, 1535952958, 
                                    1535953066), 
                                  class = c("POSIXct", "POSIXt"), tzone = ""), 
       State_type = structure(c(6L, 4L, 8L, 4L, 8L, 4L, 12L, 4L, 8L, 4L, 12L, 4L, 12L), 
                              .Label = c("Comlink Down", "Comlink Up", "Counter", "Error", 
                                         "Message", "No part in", "No part out", "Not active", 
                                         "Part changing", "Part in", "Part out", "Producing", 
                                         "Waiting"), 
                              class = "factor"), 
       event_ID = c("e00000000000072160", "e00000000000072270", "e00000000000072400", 
                    "e00000000000072430", "e00000000000072810", "e00000000000073110", 
                    "e00000000000073150", "e00000000000073170", "e00000000000073300", 
                    "e00000000000073520", "e00000000000073540", "e00000000000073570", 
                    "e00000000000074040"), 
       Error_ID = c(0, 1, 1, 1, 1, 1, 0, 2, 2, 2, 0, 3, 0), 
       Error_startTS = structure(c(NA, 1535952618, 1535952618, 1535952618, 1535952618, 
                                   1535952618, NA, 1535952846, 1535952846, 1535952846, 
                                   NA, 1535952958, NA), 
                                 class = c("POSIXct", "POSIXt"), tzone = ""), 
       Error_endTS = structure(c(NA, 1535952840, 1535952840, 1535952840, 1535952840, 
                                 1535952840, NA, 1535952952, 1535952952, 1535952952, 
                                 NA, 1535953066, NA), 
                               class = c("POSIXct", "POSIXt"), tzone = ""), 
       Lifecycle_ID = c(NA, "Start", "Ongoing", "Ongoing", "Ongoing", "Ongoing", NA, 
                        "Start", "Ongoing", "Ongoing", NA, "Start", NA)), 
  .Names = c("Resource", "Datetime_local", "State_type", "event_ID", "Error_ID", 
            "Error_startTS", "Error_endTS", "Lifecycle_ID"), 
  row.names = 160:172, class = "data.frame")

...看起来像这样

# Resource      Datetime_local State_type           event_ID Error_ID       Error_startTS         Error_endTS Lifecycle_ID
160      L60 2018-09-03 07:29:54 No part in e00000000000072160        0                <NA>                <NA>         <NA>
161      L60 2018-09-03 07:30:18      Error e00000000000072270        1 2018-09-03 07:30:18 2018-09-03 07:34:00        Start
162      L60 2018-09-03 07:30:43 Not active e00000000000072400        1 2018-09-03 07:30:18 2018-09-03 07:34:00      Ongoing
163      L60 2018-09-03 07:30:51      Error e00000000000072430        1 2018-09-03 07:30:18 2018-09-03 07:34:00      Ongoing
164      L60 2018-09-03 07:33:07 Not active e00000000000072810        1 2018-09-03 07:30:18 2018-09-03 07:34:00      Ongoing
165      L60 2018-09-03 07:33:55      Error e00000000000073110        1 2018-09-03 07:30:18 2018-09-03 07:34:00      Ongoing
166      L60 2018-09-03 07:34:00  Producing e00000000000073150        0                <NA>                <NA>         <NA>
167      L60 2018-09-03 07:34:06      Error e00000000000073170        2 2018-09-03 07:34:06 2018-09-03 07:35:52        Start
168      L60 2018-09-03 07:34:50 Not active e00000000000073300        2 2018-09-03 07:34:06 2018-09-03 07:35:52      Ongoing
169      L60 2018-09-03 07:35:49      Error e00000000000073520        2 2018-09-03 07:34:06 2018-09-03 07:35:52      Ongoing
170      L60 2018-09-03 07:35:52  Producing e00000000000073540        0                <NA>                <NA>         <NA>
171      L60 2018-09-03 07:35:58      Error e00000000000073570        3 2018-09-03 07:35:58 2018-09-03 07:37:46        Start
172      L60 2018-09-03 07:37:46  Producing e00000000000074040        0                <NA>                <NA>         <NA>

UDF

ErrorNumberAddLastRow <- function(df){
  df %>%
    mutate_if(is.factor, as.character) %>%
    group_by(Error_ID) %>%
    do(add_row(.,
               Lifecycle_ID = "Complete", State_type = "Error")) %>%
    ungroup() %>%
    fill("Resource", "event_ID","Error_ID", "Error_startTS", "Error_endTS") %>%
    # mutate(event_ID = event_ID+1) %>%          # error: non-numeric argument to binary operator.
    # mutate(Datetime_local = Error_endTS) %>%   # assigns the same TS to the whole group
    arrange(event_ID) %>% 
    filter( !(Error_ID==0 & Lifecycle_ID=="Complete") | is.na(Lifecycle_ID))
}

致电udf

ErrorNumberAddLastRow(my_df)

给出此结果

# A tibble: 16 x 8
   Resource Datetime_local      State_type event_ID           Error_ID Error_startTS       Error_endTS         Lifecycle_ID
   <chr>    <dttm>              <chr>      <chr>                 <dbl> <dttm>              <dttm>              <chr>       
 1 L60      2018-09-03 07:29:54 No part in e00000000000072160        0 NA                  NA                  NA          
 2 L60      2018-09-03 07:30:18 Error      e00000000000072270        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Start       
 3 L60      2018-09-03 07:30:43 Not active e00000000000072400        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing     
 4 L60      2018-09-03 07:30:51 Error      e00000000000072430        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing     
 5 L60      2018-09-03 07:33:07 Not active e00000000000072810        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing     
 6 L60      2018-09-03 07:33:55 Error      e00000000000073110        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing     
 7 L60      NA                  Error      e00000000000073110        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Complete    
 8 L60      2018-09-03 07:34:00 Producing  e00000000000073150        0 NA                  NA                  NA          
 9 L60      2018-09-03 07:34:06 Error      e00000000000073170        2 2018-09-03 07:34:06 2018-09-03 07:35:52 Start       
10 L60      2018-09-03 07:34:50 Not active e00000000000073300        2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing     
11 L60      2018-09-03 07:35:49 Error      e00000000000073520        2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing     
12 L60      NA                  Error      e00000000000073520        2 2018-09-03 07:34:06 2018-09-03 07:35:52 Complete    
13 L60      2018-09-03 07:35:52 Producing  e00000000000073540        0 NA                  NA                  NA          
14 L60      2018-09-03 07:35:58 Error      e00000000000073570        3 2018-09-03 07:35:58 2018-09-03 07:37:46 Start       
15 L60      NA                  Error      e00000000000073570        3 2018-09-03 07:35:58 2018-09-03 07:37:46 Complete    
16 L60      2018-09-03 07:37:46 Producing  e00000000000074040        0 NA                  NA                  NA      

所需结果

# # A tibble: 16 x 8
# Resource Datetime_local      State_type event_ID           Error_ID Error_startTS       Error_endTS         Lifecycle_ID
# <chr>    <dttm>              <chr>      <chr>                 <dbl> <dttm>              <dttm>              <chr>       
#  1 L60      2018-09-03 07:29:54 No part in e00000000000072160        0 NA                  NA                  NA          
#  2 L60      2018-09-03 07:30:18 Error      e00000000000072270        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Start       
#  3 L60      2018-09-03 07:30:43 Not active e00000000000072400        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing     
#  4 L60      2018-09-03 07:30:51 Error      e00000000000072430        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing     
#  5 L60      2018-09-03 07:33:07 Not active e00000000000072810        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing     
#  6 L60      2018-09-03 07:33:55 Error      e00000000000073110        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Ongoing     
#  7 L60      2018-09-03 07:34:00 Error      e00000000000073111        1 2018-09-03 07:30:18 2018-09-03 07:34:00 Complete    
#  8 L60      2018-09-03 07:34:00 Producing  e00000000000073150        0 NA                  NA                  NA          
#  9 L60      2018-09-03 07:34:06 Error      e00000000000073170        2 2018-09-03 07:34:06 2018-09-03 07:35:52 Start       
# 10 L60      2018-09-03 07:34:50 Not active e00000000000073300        2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing     
# 11 L60      2018-09-03 07:35:49 Error      e00000000000073520        2 2018-09-03 07:34:06 2018-09-03 07:35:52 Ongoing     
# 12 L60      2018-09-03 07:35:52 Error      e00000000000073521        2 2018-09-03 07:34:06 2018-09-03 07:35:52 Complete    
# 13 L60      2018-09-03 07:35:52 Producing  e00000000000073540        0 NA                  NA                  NA          
# 14 L60      2018-09-03 07:35:58 Error      e00000000000073570        3 2018-09-03 07:35:58 2018-09-03 07:37:46 Start       
# 15 L60      2018-09-03 07:37:46 Error      e00000000000073571        3 2018-09-03 07:35:58 2018-09-03 07:37:46 Complete    
# 16 L60      2018-09-03 07:37:46 Producing  e00000000000074040        0 NA                  NA                  NA   

详细信息

在第7、12和15行

  • 将event_ID增加1
  • 将组的“ Error_endTS”添加到Datetime_local时间戳记

在函数中取消对mutate语句的注释

  1. mutate(event_ID = event_ID+1) %>%

...引发错误

  

mutate_impl(.data,点)中的错误:评估错误:非数字   二进制运算符的参数。

  1. mutate(Datetime_local = Error_endTS) %>%

...这会将相同的TS分配给整个组

感谢您能给我的任何帮助。

1 个答案:

答案 0 :(得分:2)

这是个主意

filenames = [
    'me.jpg',
    'me.txt',
    'friend1.jpg',
    'friend2.bmp',
    'you.jpeg',
    'you.xml']

acceptor = ImageFileAcceptor()
image_filenames = filter(acceptor, filenames)
print image_filenames