如何根据点击流数据创建用户路径

时间:2019-01-10 15:54:38

标签: r clickstream

我有一些点击流数据,我想以一种特殊的方式进行归因分析,但是我需要为转换用户和未转换用户确定一种特定格式。

代表数据:

df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002, 
                             2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001", 
                                                                                                   "1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set", 
                                                                                                   "Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"), 
                 Date_time = structure(c(1540103940, 1540104060, 1540104240, 
                                         1540318080, 1540318680, 1540318859, 1540314360, 1540413060, 
                                         1540413240, 1540538460, 1540538640, 1540629660, 1540755060, 
                                         1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                 Source = c("Facebook", "Facebook", "Facebook", "Google", 
                            "Email", "Google", "Email", "Referral", "Referral", "Facebook", 
                            "Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0, 
                                                                                                  0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df", 
                                                                                                                                                        "tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
                                                                                                                                                          cols = list(User_ID = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                            "collector")), Session_ID = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                    "collector")), Date_time = structure(list(format = ""), class = c("collector_datetime", 
                                                                                                                                                                                                                                                                                                                                      "collector")), Source = structure(list(), class = c("collector_character", 
                                                                                                                                                                                                                                                                                                                                                                                          "collector")), Conversion = structure(list(), class = c("collector_double", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                  "collector"))), default = structure(list(), class = c("collector_guess", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "collector")), skip = 1), class = "col_spec"))

然后设置课程:

df <- df %>% 
  mutate(User_ID    = as.factor(User_ID),
         Session_ID = as.factor(Session_ID),
         Date_time  = as.POSIXct(Date_time)
         )

我想获取所有要访问的用户访问路径,或不引起购买的访问路径的总路径。

新列path的格式例如为:Facebook > Facebook > Facebook > Email > Email用于用户2001,我知道如何使用 mutate(path = paste0(source, collapse = " > "))

并发症是:

  • 未设置大多数会话ID,这意味着它们丢失了
  • 某些用户可能会多次转换
  • 某些用户可以转换并返回但不能转换

每行将是:

  • 通过用户ID进行的转化-大多数转化的用户仅转化一次,但是 有些可能会转换多次,在这种情况下,会有一行 每次转化。 path列将反映前往 转换-仅针对用户的第二次或后续转换 将显示上一次转换后的路径。
  • 或未转换的用户旅程,其总路径采用上述格式

对于上述reprex,结果如下所示:

# A tibble: 5 x 5
  User_ID Session_ID Date_time           Conversion Path                                          
    <dbl> <chr>      <dttm>                   <dbl> <chr>                                         
1    2001 1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2    2002 Not Set    2018-10-24 20:34:00          1 Google > Google > Referral > Referral         
3    2003 Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4    2002 1009       2018-10-28 19:34:00          0 Referral > Direct                             
5    2001 Not Set    2018-10-29 08:50:00          1 Google > Direct     

...其中:

  • 用户2001转换了两次,路径分别表示;
  • 用户2002转换后又返回,但没有转换,因此转换后的路径和未转换的路径表示为单独的行。
  • 用户2003从未转换过,因此表示了此路径。

1 个答案:

答案 0 :(得分:3)

这是使用dplyr的一种方法:

df2 <- df %>%
  # Add a column to distinguish between known and unknown sessions
  mutate(known_session = Session_ID != "Not Set") %>%

  # For each user, split between know and unknown sessions...
  group_by(User_ID, known_session) %>%
  # Sort first by Session ID, then time
  arrange(Session_ID, Date_time) %>%
  # Track which # path they're on. Start with path #1; 
  #   new path if prior event was a conversion
  mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%

  # Label path journey by combining everything so far
  mutate(Path = paste0(Source, collapse = " > ")) %>%
  # Just keep last step in each path
  filter(row_number() == n()) %>%
  ungroup() %>%

  # Tidying up with just the desired columns, chronological
  select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
  arrange(Date_time)

我得到的结果略有不同,但我认为它们与提供的示例数据相对应:

> df2
# A tibble: 5 x 5
  User_ID Session_ID Date_time      

     Conversion Path                                          
  <fct>   <fct>      <dttm>                   <dbl> <chr>                                         
1 2001    1007       2018-10-23 17:06:00          1 Facebook > Facebook > Facebook > Email > Email
2 2002    Not Set    2018-10-24 20:34:00          1 Referral > Referral                           
3 2003    Not Set    2018-10-26 07:24:00          0 Facebook > Facebook                           
4 2002    1009       2018-10-28 19:34:00          0 Google > Google > Referral > Direct           
5 2001    Not Set    2018-10-29 08:50:00          1 Google > Direct