在多个条件下按组扩展数据

时间:2018-08-21 16:30:34

标签: r dplyr purrr

我有关于Jenkins Job Pipeline Executions的数据,我试图根据数据中的开始和结束时间来确定从开发到生产所需的平均时间。数据有点像事务数据库,其中开发管道的执行是唯一的记录,然后到生产的同一管道的执行是另一个唯一的记录(仅共享分组变量,即运行作业的团队)。

以下是我开始使用的数据的示例:

  job_id   startTime            endTime               env_type  Team_ID
1  100      8/4/2017 17:14:00   8/4/2017 17:16:00      DEV       A
2  101      8/4/2017 17:20:00   8/4/2017 17:21:00      DEV       A
3  102      8/4/2017 17:24:00   8/4/2017 17:27:00      DEV       B
4  103      8/4/2017 17:38:00   8/4/2017 17:40:00      DEV       B
5  104      8/4/2017 17:40:00   8/4/2017 17:42:00      DEV       C
6  105      8/4/2017 17:51:00   8/4/2017 17:54:00      DEV       C

在我第一次尝试扩展数据时,我使用mutate创建了新列,并根据env_type复制了开始时间和结束时间:

df %>%
    mutate(prod_job_id = ifelse(env_type == "PROD", job_id, ""), 
           prod_start_time = ifelse(env_type == "PROD", startTime, ""), 
           prod_end_time = ifelse(env_type == "PROD", endTime, ""),  
           dev_job_id = ifelse(env_type == "DEV", job_id, ""), 
           dev_start_time = ifelse(env_type == "DEV", startTime, ""), 
           dev_end_time = ifelse(env_type == "DEV", endTime, ""))

这让我想到了类似的东西(也使用as.POSIXct转换了时间):

Team_ID env_type      dev_start_time        dev_end_time     prod_start_time       prod_end_time
1        A      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
2        A      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
3        A     PROD                <NA>                <NA> 2018-08-02 14:00:00 2018-08-02 15:00:00
4        A     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
5        B      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
6        B      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
7        B     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
8        C      DEV 2018-08-05 12:00:00 2018-08-05 13:00:00                <NA>                <NA>
9        C      DEV 2018-08-06 12:00:00 2018-08-06 13:00:00                <NA>                <NA>
10       C     TEST 2018-08-06 14:00:00 2018-08-06 15:00:00                <NA>                <NA>

这是赔率:

structure(list(Team_ID = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"
), class = "factor"), pipeline_id = c(1000L, 1000L, 1000L, 1000L, 
2000L, 2000L, 2000L, 3000L, 3000L, 3000L, 4000L, 4000L, 5000L, 
5000L), env_type = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
1L, 3L, 1L, 1L, 2L, 2L), .Label = c("DEV", "PROD", "TEST"), class = "factor"), 
    dev_start_time = structure(c(1533142800, 1533229200, NA, 
    NA, 1533142800, 1533229200, NA, 1533488400, 1533574800, 1533582000, 
    1533142800, 1533229200, NA, NA), class = c("POSIXct", "POSIXt"
    ), tzone = ""), dev_end_time = structure(c(1533146400, 1533232800, 
    NA, NA, 1533146400, 1533232800, NA, 1533492000, 1533578400, 
    1533585600, 1533146400, 1533232800, NA, NA), class = c("POSIXct", 
    "POSIXt"), tzone = ""), prod_start_time = structure(c(NA, 
    NA, 1533236400, 1533243600, NA, NA, 1533243600, NA, NA, NA, 
    NA, NA, 1533236400, 1533243600), class = c("POSIXct", "POSIXt"
    ), tzone = ""), prod_end_time = structure(c(NA, NA, 1533240000, 
    1533247200, NA, NA, 1533247200, NA, NA, NA, NA, NA, 1533240000, 
    1533247200), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA, 
-14L))

棘手的是,管道可能在进入生产之前要经过多次开发,甚至可能在之后再次进入生产,而不必回到上面的数据框中看到的开发。

我试图弄清楚如何创建循环(或dplyr / purrr命令链或某些* ply函数)以对齐数据,以便可以使用diffTime来获取部署持续时间。最终目标是获取从dev到prod的所有管道的diffTimes,然后取平均值。 为了实现我的目标,我通过尝试将数据放入这样的方式来解决这个问题(在操作之后,env_type将不再有效-但这没关系,因为我最后只对diffTime感兴趣):< / p>

Team_ID env_type      dev_start_time        dev_end_time     prod_start_time       prod_end_time diffTime
1       A     PROD 2018-08-01 12:00:00 2018-08-01 13:00:00 2018-08-02 14:00:00 2018-08-02 15:00:00  2678400
2       B     PROD 2018-08-02 12:00:00 2018-08-02 13:00:00 2018-08-02 16:00:00 2018-08-02 17:00:00    18000

用英语,我认为我需要的是:

对于env_type ==“ PROD”的每一行,找到最接近Dev的时间戳,并用该值覆盖Dev列-类似于max(dev_end_time,其中dev_end_time不大于prod_start_time并且dev_end_time大于prod_end_time的先前值)。我知道数据需要按Team_ID分组并按顺序排列。我也知道,我必须先查看产品流水线,然后再进行反向工作。

我已经开始:

df %>% 
    group_by(Team_ID) %>% 
    arrange(Team_ID, startTime) 

以便按时间顺序对数据进行分组和排列。但是我应该从这里去哪里呢?我首先认为mutate可能有效: mutate(dev_start_time = ifelse((dev_end_time < prod_start_time) & (dev_end_time > prod_start_time -1)), dev_start_time, ""),但我不知道如何让R查看正确的行(prod_start_time -1应该是prod的前一行而不是时间-1)。

我知道必须有某种方法可以做到这一点,但我只是不熟悉完成它的功能。

编辑:

对于@LetEpsilonBeLessThanZero 我试图通过pipeline_id跨组进行讨论,然后过滤至少具有1个dev和1个prod行的数据将删除有价值的数据。为了说明这一点,让我们看下面的数据:

Team_ID pipeline_id env_type      dev_start_time        dev_end_time     prod_start_time       prod_end_time
1        A        1000      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
2        A        1000      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
3        A        1000     PROD                <NA>                <NA> 2018-08-02 14:00:00 2018-08-02 15:00:00
4        A        1000     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
5        B        2000      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
6        B        2000      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
7        B        2000     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
8        C        3000      DEV 2018-08-05 12:00:00 2018-08-05 13:00:00                <NA>                <NA>
9        C        3000      DEV 2018-08-06 12:00:00 2018-08-06 13:00:00                <NA>                <NA>
10       C        3000     TEST 2018-08-06 14:00:00 2018-08-06 15:00:00                <NA>                <NA>
11       D        4000      DEV 2018-08-01 12:00:00 2018-08-01 13:00:00                <NA>                <NA>
12       D        4000      DEV 2018-08-02 12:00:00 2018-08-02 13:00:00                <NA>                <NA>
13       D        5000     PROD                <NA>                <NA> 2018-08-02 14:00:00 2018-08-02 15:00:00
14       D        5000     PROD                <NA>                <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00

请注意,D团队如何创建了独特的开发管道和独特的Prod管道。我仍然需要一种链接它们并测量时差的方法,因为我知道部署是用于同一应用程序的,但是无法通过按pipeline_id分组的建议来完成。

另一方面,我知道我们需要一种将这些团队组合在一起的新方法,以更轻松地关联这些工作,现在有计划实现这一目标。但是我仍然必须找到一种方法,以目前所拥有的一切来获取最好的数据,因此,感谢所有帮助。

2 个答案:

答案 0 :(得分:0)

下面的代码怎么样?我修改了其中一个虚拟数据集,以便可以测试一些不同的情况。

df数据帧是未更改的伪数据集。

df_w_implied_proj_id将向您展示如何确定“ proj_id”,这是我创建的字段。 proj_id旨在表示“真实”管道。

mean_dev_df可以计算出proj_id中的平均总diffTime。

library(dplyr)

df = data.frame(startTime = as.POSIXct(c("2018-08-01 12:00:00",
                                         "2018-08-02 10:00:00",
                                         "2018-08-02 14:00:00",
                                         "2018-08-02 16:00:00",
                                         "2018-08-01 12:00:00",
                                         "2018-08-02 12:00:00",
                                         "2018-08-02 16:00:00",
                                         "2018-08-05 12:00:00",
                                         "2018-08-06 12:00:00",
                                         "2018-08-06 14:00:00",
                                         "2018-08-06 16:00:00",
                                         "2018-08-06 18:00:00",
                                         "2018-08-01 12:00:00",
                                         "2018-08-02 12:00:00",
                                         "2018-08-02 14:00:00",
                                         "2018-08-02 16:00:00"), format="%Y-%m-%d %H:%M:%S"),
                endTime = as.POSIXct(c("2018-08-01 13:00:00",
                                       "2018-08-02 13:00:00",
                                       "2018-08-02 15:00:00",
                                       "2018-08-02 18:00:00",
                                       "2018-08-01 13:00:00",
                                       "2018-08-02 13:00:00",
                                       "2018-08-02 18:00:00",
                                       "2018-08-05 13:00:00",
                                       "2018-08-06 13:00:00",
                                       "2018-08-06 15:00:00",
                                       "2018-08-06 17:00:00",
                                       "2018-08-06 19:00:00",
                                       "2018-08-01 13:00:00",
                                       "2018-08-02 13:00:00",
                                       "2018-08-02 15:00:00",
                                       "2018-08-02 21:00:00"), format="%Y-%m-%d %H:%M:%S"),
                env_type = c("DEV","DEV","PROD","PROD","DEV","DEV","PROD","DEV","DEV","PROD","DEV","PROD","DEV","DEV","PROD","PROD"),
                Team_ID = c("A","A","A","A","B","B","B","C","C","C","C","C","D","D","D","D"))

df_w_implied_proj_id = df %>%
  arrange(Team_ID, startTime) %>%
  mutate(diffTimeSecs = difftime(endTime,startTime,units="secs"),
         proj_id = cumsum(env_type != lag(env_type, default = first(env_type))) %/% 2 + 1) %>%
  group_by(proj_id) %>%
  mutate(total_proj_diffTimeSecs = sum(diffTimeSecs))

mean_dev_df = df_w_implied_proj_id %>%
  group_by(proj_id) %>%
  summarise(temp_totals = sum(diffTimeSecs)) %>%
  ungroup() %>%
  summarise(mean_total_proj_diffTimeSecs = mean(temp_totals))

此代码的主要工蜂是这一行:

proj_id = cumsum(env_type != lag(env_type, default = first(env_type))) %/% 2 + 1

要了解它,我们来看一下数据集中的env_type值:

env_type
DEV
DEV
PROD
PROD
DEV
DEV
PROD
DEV
DEV
PROD
DEV
PROD
DEV
DEV
PROD
PROD

lag函数仅返回上一行的值。因此,作为一个随机示例,lag(c("A","B","C"),default="BALLOON")将返回c("BALLOON","A","B")

因此env_type != lag(env_type, default = first(env_type))将返回以下内容:

env_type != lag(env_type, default = first(env_type))
0 (note: there's no row before the first row, so the lag statement defaults this to the first element of env_type vector, which is "DEV". And "DEV" != "DEV" evaluates to FALSE aka 0)
0 (note: "DEV" != "DEV" evaluates to FALSE aka 0)
1 (note: "PROD" != "DEV" evaluates to TRUE aka 1)
0 (note: "PROD != "PROD" evaluates to FALSE aka 0. By now you hopefully get the gist of what's going on.)
1
0
1
1
0
1
1
1
1
0
1
0

然后0和1的向量的cumsum(...)得出:

0 0 1 1 2 2 3 4 4 5 6 7 8 8 9 9

每增加1表示从“ DEV”切换到“ PROD”,反之亦然。

然后我们可以将每个偶数与它的奇数后继者压在一起,方法是将每个数字除以2,然后再加1得到:

1 1 1 1 2 2 2 3 3 3 4 4 5 5 5 5

这些是我们最终的proj_id。

答案 1 :(得分:0)

因为他为我提供了有关dplyr :: lag()的一些指导,所以答案的信用确实归letepsilonbelessthanzero所用。但是我已经测试了以下解决方案,并且该解决方案完全可以满足我的需要。

df %>% 
    group_by(Team_ID) %>% 
    arrange(Team_ID, startTime) %>% 
    mutate("Dev-Prod" = as.numeric(difftime(prod_end_time, lag(dev_start_time), units = "secs"))) %>%
    filter(!is.na(`Dev-Prod`))