频繁序列统计

时间:2017-05-02 02:31:42

标签: r dplyr sequence reshape

我有一个购买交易数据集。下面是一个用于说明的虚拟数据集。我试图找出重塑/ dcast如何获得最频繁的购买顺序。

require(data.table)

MainID=c('A1','A1','A2','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016')

df=data.table(MainID,Purchase,Date)
head(df)

   MainID Purchase      Date
1:     A1        A  1/1/2014
2:     A1        B 5/23/2015
3:     A2        C 6/12/2015
4:     C1        A  3/3/2013
5:     C1        A  5/5/2014
6:     C1        D 7/21/2014

现在我在这里寻找2对成对的多个序列组合作为开始。与上面的数据集一样,下面是一组独特的序列对:( A导向B,B导向C,A导向D,E导向B,最后C导向E) 请注意,我不会把A带到A - 我正在研究不同产品的序列,而不是相同的产品。因此在输出中我想忽略所有那些相似的产品序列。

需要输出:

Pair                  Occurrence         No of customers        % confidence 
A leads to B             1                    3                    1/3
B leads to C             2                    3                    2/3
A leads to D             1                    3                    1/3
E leads to B             1                    3                    1/3
C leads to E             2                    3                    2/3 

我知道测序算法,但我在这里看一些基本的描述性分析。

1 个答案:

答案 0 :(得分:1)

如果我理解你想要什么,这可能会有效。请注意,我从您的数据中将A2更改为A1,并且我添加了一个日期,以便为Date提供长度为11的向量。我还直接创建了一个tibble,而不是使用data.table。

MainID=c('A1','A1','A1','C1','C1','C1','D2','D2','D2','A1','D2')
Purchase=c('A','B','C','A','A','D','E','B','C','E','E')
Date=c('1/1/2014','5/23/2015','6/12/2015','3/3/2013','5/5/2014','7/21/2014','1/3/2016','4/5/2016','7/7/2016','6/27/2016', '8/8/2016')
df=data_frame(MainID,Purchase,Date)
df2 <- df %>%
  group_by(MainID) %>%
  arrange(MainID, Date) %>%
  mutate(Next = lead(Purchase, 1),
         Pair = paste(Purchase, "leads to", Next)) %>%
  filter(!is.na(Next), Purchase != Next) %>%
  ungroup() %>%
  group_by(Pair) %>%
  summarise(Occurence = n()) %>%
  mutate(N_consumers = length(unique(MainID)),
         Percent_confidence = paste0(Occurence, "/", N_consumers))

df2
# A tibble: 5 <U+00D7> 4
          Pair Occurence N_consumers Percent_confidence
         <chr>     <int>       <int>              <chr>
1 A leads to B         1           3                1/3
2 A leads to D         1           3                1/3
3 B leads to C         2           3                2/3
4 C leads to E         2           3                2/3
5 E leads to B         1           3                1/3