在点击流数据中替换来源

时间:2018-10-27 23:16:09

标签: r clickstream

我有一个电子商务网站的点击流数据。一些客户可以选择使用贷款/融资选项购买产品。不幸的是,这创建了一个新的引荐来源-在下面的标签为“财务”的代表部分中。它还会创建一个或多个新会话。

我想用同一用户先前会话的源的源替换源“财务”。

在此示例中,会话4-6871.24-6871.3的所有观测值将根据会话4-6871.1拥有源“直接”,而3-6871.1则将“ google”作为源根据会话3-6871.0

我需要在更大的数据集上执行此操作,因此我需要应用逻辑来查找具有“财务”源的会话,并将“财务”的实例替换为用户先前会话中的前一个源。

通过dput重新表示数据:

structure(list(userId = c("6.154032", "6.154032", "6.154032", 
"6.154032", "6.154032", "6.154032", "6.154032", "6.154032", "6.154032", 
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036", 
"8.154036", "8.154036", "8.154036", "8.154036", "8.154036", "8.154036", 
"8.154036", "8.154036"), session_Id = c("4-6871.0", "4-6871.0", 
"4-6871.0", "4-6871.1", "4-6871.1", "4-6871.1", "4-6871.2", "4-6871.2", 
"4-6871.3", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", "3-6871.0", 
"3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", "3-6871.1", 
"3-6871.1", "3-6871.1", "3-6871.1"), timeStamp = structure(c(1540294773, 
1540294828, 1540294841, 1540307321, 1540307341, 1540307718, 1540308709, 
1540308749, 1540311289, 1540330293, 1540330309, 1540330475, 1540330541, 
1540330663, 1540331041, 1540331164, 1540331168, 1540331312, 1540331459, 
1540331465, 1540331579, 1540331603, 1540331630), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), source = c("(direct)", "(direct)", 
"(direct)", "(direct)", "(direct)", "(direct)", "finance", "finance", 
"finance", "google", "google", "google", "google", "google", 
"finance", "finance", "finance", "finance", "finance", "finance", 
"finance", "finance", "finance")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -23L))

1 个答案:

答案 0 :(得分:1)

也许您的完整数据结构中的某些内容使该解决方案无效,但这是一个候选人:

df <- arrange(df, userId, timeStamp)
tmp <- rle(df$source)
tmp$values[tmp$values == "finance"] <- lag(tmp$values)[tmp$values == "finance"]
df$source <- inverse.rle(tmp)
table(df$source)
# (direct)   google 
#        9       14 

在第一行中,我确保顺序正确。然后,假设没有用户,他们的第一个来源可以立即成为“财务”,在接下来的两行中,我将所有“财务”条目替换为前面的条目。