比较行以查看客户是否已切换产品

时间:2019-12-14 18:41:05

标签: r

我有一个Excel工作流,我试图将其转换为R脚本,但无法弄清楚如何将公式转换为R可以理解的东西。

给出下面的小示例表,我想找出给定客户是否已从一种产品切换到另一种产品,以及该客户记录的两个给定日期之间有多少天。在Excel中,这很容易做到: 要查找是否有开关,我将使用“ = IF(AND(B2 = B1,D2 <> D1),1,0)”。如果从第一行到下一行的ID相同,并且乘积发生变化,那么我在这里得到1。否则我得到0。 要计算日期之间的天数,请使用“ = IF(B2 = B1,DATEDIF(A1,A2,“ d”),0)“。

理想情况下,我只想统计客户使用给定产品的天数,其中最后一个日期显示的天数与今天的日期有关,但这可能太复杂了。

在R中从初始表到最终表是否有快速简便的方法?

示例数据:

Date <- c("1/1/2019", "1/3/2019", "1/4/2019", "1/20/2019", 
          "1/24/2019", "2/6/2019", "3/2/2019", "3/25/2019", "4/9/2019", 
          "4/24/2019", "5/1/2019", "5/6/2019", "5/13/2019", "5/15/2019", 
          "1/1/2019", "1/3/2019", "1/4/2019", "1/20/2019", "1/24/2019", 
          "2/6/2019", "3/2/2019", "3/25/2019", "4/9/2019")

Id <- c(1, 1, 1, 1, 3, 3, 3, 2, 4, 4, 4, 4, 4, 4, 5, 5,
        5, 5, 5, 5, 6, 7, 7)

Value <- c(991, 434, 741, 509, 421, 904, 728, 172, 341, 903,
           367, 378, 351, 906, 178, 649, 264, 935, 988, 694,
           334, 884, 545)

Product <- c("Product A", "Product B", "Product B", "Product C",
             "Product A", "Product A", "Product A","Product D",
             "Product A", "Product B", "Product C", "Product D",
             "Product C", "Product D", "Product A", "Product A",
             "Product A", "Product A", "Product A", "Product A",
             "Product B", "Product C", "Product D")

df <- data_frame(Date, Id, Value, Product)

df$Date <- mdy(df$Date)

# Initial table:
# A tibble: 23 x 4
   Date          Id Value Product  
   <date>     <dbl> <dbl> <chr>    
 1 2019-01-01     1   991 Product A
 2 2019-01-03     1   434 Product B
 3 2019-01-04     1   741 Product B
 4 2019-01-20     1   509 Product C
 5 2019-01-24     3   421 Product A
 6 2019-02-06     3   904 Product A
 7 2019-03-02     3   728 Product A
 8 2019-03-25     2   172 Product D
 9 2019-04-09     4   341 Product A
10 2019-04-24     4   903 Product B
11 2019-05-01     4   367 Product C
12 2019-05-06     4   378 Product D
13 2019-05-13     4   351 Product C
14 2019-05-15     4   906 Product D
15 2019-01-01     5   178 Product A
16 2019-01-03     5   649 Product A
17 2019-01-04     5   264 Product A
18 2019-01-20     5   935 Product A
19 2019-01-24     5   988 Product A
20 2019-02-06     5   694 Product A
21 2019-03-02     6   334 Product B
22 2019-03-25     7   884 Product C
23 2019-04-09     7   545 Product D

Final table:
# A tibble: 23 x 6
   Date          Id Value Product   Switched Days_between_dates
   <date>     <dbl> <dbl> <chr>        <dbl>              <dbl>
 1 2019-01-01     1   991 Product A        0                  0
 2 2019-01-03     1   434 Product B        1                  2
 3 2019-01-04     1   741 Product B        0                  1
 4 2019-01-20     1   509 Product C        1                 16
 5 2019-01-24     3   421 Product A        0                  0
 6 2019-02-06     3   904 Product A        0                 13
 7 2019-03-02     3   728 Product A        0                 24
 8 2019-03-25     2   172 Product D        0                  0
 9 2019-04-09     4   341 Product A        0                  0
10 2019-04-24     4   903 Product B        1                 15
11 2019-05-01     4   367 Product C        1                  7
12 2019-05-06     4   378 Product D        1                  5
13 2019-05-13     4   351 Product C        1                  7
14 2019-05-15     4   906 Product D        1                  2
15 2019-01-01     5   178 Product A        0                  0
16 2019-01-03     5   649 Product A        0                  2
17 2019-01-04     5   264 Product A        0                  1
18 2019-01-20     5   935 Product A        0                 16
19 2019-01-24     5   988 Product A        0                  4
20 2019-02-06     5   694 Product A        0                 13
21 2019-03-02     6   334 Product B        0                  0
22 2019-03-25     7   884 Product C        0                  0
23 2019-04-09     7   545 Product D        1                 15

2 个答案:

答案 0 :(得分:1)

library(lubridate)
library(tidyverse)

df %>%
  group_by(Id) %>%
  mutate(Switched = as.numeric(Product != lag(Product, default = Product[1])), 
         Days_between_dates = as.numeric(Date - lag(Date, default = Date[1])))

输出

# A tibble: 23 x 6
# Groups:   Id [7]
   Date          Id Value Product   Switched Days_between_dates
   <date>     <dbl> <dbl> <fct>        <dbl>              <dbl>
 1 2019-01-01     1   991 Product A        0                  0
 2 2019-01-03     1   434 Product B        1                  2
 3 2019-01-04     1   741 Product B        0                  1
 4 2019-01-20     1   509 Product C        1                 16
 5 2019-01-24     3   421 Product A        0                  0
 6 2019-02-06     3   904 Product A        0                 13
 7 2019-03-02     3   728 Product A        0                 24
 8 2019-03-25     2   172 Product D        0                  0
 9 2019-04-09     4   341 Product A        0                  0
10 2019-04-24     4   903 Product B        1                 15
11 2019-05-01     4   367 Product C        1                  7
12 2019-05-06     4   378 Product D        1                  5
13 2019-05-13     4   351 Product C        1                  7
14 2019-05-15     4   906 Product D        1                  2
15 2019-01-01     5   178 Product A        0                  0
16 2019-01-03     5   649 Product A        0                  2
17 2019-01-04     5   264 Product A        0                  1
18 2019-01-20     5   935 Product A        0                 16
19 2019-01-24     5   988 Product A        0                  4
20 2019-02-06     5   694 Product A        0                 13
21 2019-03-02     6   334 Product B        0                  0
22 2019-03-25     7   884 Product C        0                  0
23 2019-04-09     7   545 Product D        1                 15

答案 1 :(得分:1)

这是base R

的解决方案
df <- Reduce(rbind,lapply(split(df,df$Id), function(v) {
  v$Switched <- c(0,ifelse(diff(match(gsub(".*?\\s","",v$Product),LETTERS))!=0,1,0))
  v$Days_between_dates <- c(0,diff(v$Date))
  v
}))

如此

> df
         Date Id Value   Product Switched Days_between_dates
1  2019-01-01  1   991 Product A        0                  0
2  2019-01-03  1   434 Product B        1                  2
3  2019-01-04  1   741 Product B        0                  1
4  2019-01-20  1   509 Product C        1                 16
8  2019-03-25  2   172 Product D        0                  0
5  2019-01-24  3   421 Product A        0                  0
6  2019-02-06  3   904 Product A        0                 13
7  2019-03-02  3   728 Product A        0                 24
9  2019-04-09  4   341 Product A        0                  0
10 2019-04-24  4   903 Product B        1                 15
11 2019-05-01  4   367 Product C        1                  7
12 2019-05-06  4   378 Product D        1                  5
13 2019-05-13  4   351 Product C        1                  7
14 2019-05-15  4   906 Product D        1                  2
15 2019-01-01  5   178 Product A        0                  0
16 2019-01-03  5   649 Product A        0                  2
17 2019-01-04  5   264 Product A        0                  1
18 2019-01-20  5   935 Product A        0                 16
19 2019-01-24  5   988 Product A        0                  4
20 2019-02-06  5   694 Product A        0                 13
21 2019-03-02  6   334 Product B        0                  0
22 2019-03-25  7   884 Product C        0                  0
23 2019-04-09  7   545 Product D        1                 15

数据

> dput(df)
structure(list(Date = structure(c(17897, 17899, 17900, 17916, 
17920, 17933, 17957, 17980, 17995, 18010, 18017, 18022, 18029, 
18031, 17897, 17899, 17900, 17916, 17920, 17933, 17957, 17980, 
17995), class = "Date"), Id = c(1, 1, 1, 1, 3, 3, 3, 2, 4, 4, 
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7), Value = c(991, 434, 741, 
509, 421, 904, 728, 172, 341, 903, 367, 378, 351, 906, 178, 649, 
264, 935, 988, 694, 334, 884, 545), Product = structure(c(1L, 
2L, 2L, 3L, 1L, 1L, 1L, 4L, 1L, 2L, 3L, 4L, 3L, 4L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 3L, 4L), .Label = c("Product A", "Product B", 
"Product C", "Product D"), class = "factor")), class = "data.frame", row.names = c(NA, 
-23L))