Question

我有一个这样的DF：

现在我在R中需要这样的可能性：

finding products that customers bought together

我的输出应如下所示：

ProductX ProductY Times
1        2        2
2        3        1
3        4        1

很好的是这样的矩阵输出：

      [,1] [,2] [,3] [,4] 
[1,]    1    2    0    0    
[2,]    2    1    1    0    
[3,]    0    0    1    1    
[4,]    0    0    1    1

我已经用reshape2软件包尝试过了，但是我不知道如何获得这种输出形式。

Answer 1

另一种dplyr和tidyr的可能性是：

df %>%
 group_by(ID) %>%
 expand(Product, Product) %>%
 filter(Product < Product1)  %>%
 ungroup() %>%
 count(Product, Product1)

  Product Product1     n
    <int>    <int> <int>
1       1        2     2
2       2        3     1
3       2        4     1
4       3        4     1

但是，我不确定多次购买同一对产品（如果可能的话）的ID的预期输出是什么。

Answer 2

以下是使用data.table和非等式自联接的答案。编辑：添加了allow.cartesian = TRUE，使其可以处理大型数据集。

library(data.table)

dt[dt 
    , on = .(ID = ID, Product < Product)
    , .(ProductX = x.Product, ProductY = i.Product)
    , nomatch = 0L
    , allow.cartesian = TRUE
   ][, .N, by = .(ProductX, ProductY)]

   ProductX ProductY N
1:        1        2 2
2:        2        3 1
3:        2        4 1
4:        3        4 1

相当于Dplyr：

library(dplyr)

inner_join(tib, tib, by = 'ID')%>%
  filter(Product.x < Product.y)%>%
  count(Product.x, Product.y)

  Product.x Product.y     n
      <dbl>     <dbl> <int>
1         1         2     2
2         2         3     1
3         2         4     1
4         3         4     1

这也是基本的R版本：

aggregate(ID ~ Product.x + Product.y
          , data = merge(df, df, by = 'ID')
          , subset = Product.x < Product.y
          , FUN = length)

# Need to change the names from ID to n
  Product.x Product.y ID
1         1         2  2
2         2         3  1
3         2         4  1
4         3         4  1

性能：

Unit: milliseconds
             expr     min       lq      mean  median       uq      max neval
           dt_way  3.9149  4.29330  4.593209  4.6597  4.80210   6.2326   100
 dplyr_inner_join  1.8218  1.91510  2.058864  2.0572  2.16205   3.0157   100
      dplyr_tidyr 13.8107 14.15735 16.020262 14.3571 14.78975 127.9654   100
         base_agg  2.3393  2.51215  2.586652  2.5804  2.63865   3.4415   100

n_IDs <- 1E3
n_Sims <- 1E5
ID_big <- sample(1:n_IDs, n_Sims, replace = TRUE)
Product_big <- sample(1:n_Sims, n_Sims, replace = TRUE)

: seconds
             expr      min       lq      mean    median        uq       max neval
           dt_way 1.633111 1.904460  1.998192  1.986452  2.110937  2.308671    10 
 dplyr_inner_join 5.606322 6.361026  6.574015  6.606423  6.839273  7.198770    10
      dplyr_tidyr 8.385418 9.350730 10.127512 10.372830 10.675809 11.462403    10

数据：

ID <- c(1,1,2,2,2,3,3,4)
Product <- c(1,2,2,3,4,1,2,3)

dt <- data.table(ID, Product)
tib <- tibble(ID, Product)
df <- data.frame(ID, Product)

Answer 3

这是使用dplyr和tidyr的一个选项。我们group_by ID并创建一次list组合的Product。对于具有一个ID的{{1}}，我们删除Product的值，最后删除NA的组合count。

Product

计算一起购买的产品

3 个答案: