我有一个购买交易数据集。下面是一个用于说明的虚拟数据集。我试图找出重塑/ dcast如何获得最频繁的购买顺序。
require(data.table)
MainID=c('A1','A1','A2','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016')
df=data.table(MainID,Purchase,Date)
head(df)
MainID Purchase Date
1: A1 A 1/1/2014
2: A1 B 5/23/2015
3: A2 C 6/12/2015
4: C1 A 3/3/2013
5: C1 A 5/5/2014
6: C1 D 7/21/2014
现在我在这里寻找2对成对的多个序列组合作为开始。与上面的数据集一样,下面是一组独特的序列对:( A导向B,B导向C,A导向D,E导向B,最后C导向E) 请注意,我不会把A带到A - 我正在研究不同产品的序列,而不是相同的产品。因此在输出中我想忽略所有那些相似的产品序列。
需要输出:
Pair Occurrence No of customers % confidence
A leads to B 1 3 1/3
B leads to C 2 3 2/3
A leads to D 1 3 1/3
E leads to B 1 3 1/3
C leads to E 2 3 2/3
我知道测序算法,但我在这里看一些基本的描述性分析。
答案 0 :(得分:1)
如果我理解你想要什么,这可能会有效。请注意,我从您的数据中将A2更改为A1,并且我添加了一个日期,以便为Date提供长度为11的向量。我还直接创建了一个tibble,而不是使用data.table。
MainID=c('A1','A1','A1','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016', '8/8/2016')
df=data_frame(MainID,Purchase,Date)
df2 <- df %>%
group_by(MainID) %>%
arrange(MainID, Date) %>%
mutate(Next = lead(Purchase, 1),
Pair = paste(Purchase, "leads to", Next)) %>%
filter(!is.na(Next), Purchase != Next) %>%
ungroup() %>%
group_by(Pair) %>%
summarise(Occurence = n()) %>%
mutate(N_consumers = length(unique(MainID)),
Percent_confidence = paste0(Occurence, "/", N_consumers))
df2
# A tibble: 5 <U+00D7> 4
Pair Occurence N_consumers Percent_confidence
<chr> <int> <int> <chr>
1 A leads to B 1 3 1/3
2 A leads to D 1 3 1/3
3 B leads to C 2 3 2/3
4 C leads to E 2 3 2/3
5 E leads to B 1 3 1/3