我有一些点击流数据,我想以一种特殊的方式进行归因分析,但是我需要为转换用户和未转换用户确定一种特定格式。
代表数据:
df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002,
2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set",
"Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"),
Date_time = structure(c(1540103940, 1540104060, 1540104240,
1540318080, 1540318680, 1540318859, 1540314360, 1540413060,
1540413240, 1540538460, 1540538640, 1540629660, 1540755060,
1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Source = c("Facebook", "Facebook", "Facebook", "Google",
"Email", "Google", "Email", "Referral", "Referral", "Facebook",
"Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
cols = list(User_ID = structure(list(), class = c("collector_double",
"collector")), Session_ID = structure(list(), class = c("collector_character",
"collector")), Date_time = structure(list(format = ""), class = c("collector_datetime",
"collector")), Source = structure(list(), class = c("collector_character",
"collector")), Conversion = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
然后设置课程:
df <- df %>%
mutate(User_ID = as.factor(User_ID),
Session_ID = as.factor(Session_ID),
Date_time = as.POSIXct(Date_time)
)
我想获取所有要访问的用户访问路径,或不引起购买的访问路径的总路径。
新列path
的格式例如为:Facebook > Facebook > Facebook > Email > Email
用于用户2001,我知道如何使用
mutate(path = paste0(source, collapse = " > "))
并发症是:
每行将是:
path
列将反映前往
转换-仅针对用户的第二次或后续转换
将显示上一次转换后的路径。对于上述reprex,结果如下所示:
# A tibble: 5 x 5
User_ID Session_ID Date_time Conversion Path
<dbl> <chr> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Google > Google > Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct
...其中:
答案 0 :(得分:3)
这是使用dplyr
的一种方法:
df2 <- df %>%
# Add a column to distinguish between known and unknown sessions
mutate(known_session = Session_ID != "Not Set") %>%
# For each user, split between know and unknown sessions...
group_by(User_ID, known_session) %>%
# Sort first by Session ID, then time
arrange(Session_ID, Date_time) %>%
# Track which # path they're on. Start with path #1;
# new path if prior event was a conversion
mutate(path_num = cumsum(lag(Conversion, default = 0)) + 1) %>%
# Label path journey by combining everything so far
mutate(Path = paste0(Source, collapse = " > ")) %>%
# Just keep last step in each path
filter(row_number() == n()) %>%
ungroup() %>%
# Tidying up with just the desired columns, chronological
select(User_ID, Session_ID, Date_time, Conversion, Path) %>%
arrange(Date_time)
我得到的结果略有不同,但我认为它们与提供的示例数据相对应:
> df2
# A tibble: 5 x 5
User_ID Session_ID Date_time
Conversion Path
<fct> <fct> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Google > Google > Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct