创建一个包含所有可能交互的表格(双向和三向)

时间:2017-12-20 22:35:50

标签: r dplyr data.table

我将此帖子引用到Create table with all pairs of values from one column in R, counting unique valuesTable of Interactions - Case with pets and houses了解如何创建双向交互表。 我怎么能在所有可能的情况下这样做? 此外,我想在这些箱子(组合)中找到出现频率和收入。

这是我的输入数据

   Customer      Product Revenue
1         A         Rice      10
2         A Sweet Potato       2
3         A       Walnut       4
4         B         Rice       3
5         B       Walnut       2
6         C       Walnut       3
7         C Sweet Potato       4
8         D         Rice       3
9         E Sweet Potato       4
10        F       Walnut       7
11        G         Rice       2
12        G Sweet Potato       3
13        H Sweet Potato       4
14        H       Walnut       6
15        I         Rice       2

DFI <- structure(list(Customer = c("A", "A", "A", "B", "B", "C", "C", 
"D", "E", "F", "G", "G", "H", "H", "I"), Product = c("Rice", 
"Sweet Potato", "Walnut", "Rice", "Walnut", "Walnut", "Sweet Potato", 
"Rice", "Sweet Potato", "Walnut", "Rice", "Sweet Potato", "Sweet Potato", 
"Walnut", "Rice"), Revenue = c(10, 2, 4, 3, 2, 3, 4, 3, 4, 7, 
2, 3, 4, 6, 2)), .Names = c("Customer", "Product", "Revenue"), row.names = c(NA, 
15L), class = "data.frame")

以下是生成产品Sweet Potato RiceWalnut的所有组合的代码:

Combinations<-do.call(c,lapply(seq_along(unique(DFI$Product)), 
  combn, x = unique(DFI$Product), simplify = FALSE))

[[1]]
[1] "Rice"

[[2]]
[1] "Sweet Potato"

[[3]]
[1] "Walnut"

[[4]]
[1] "Rice"         "Sweet Potato"

[[5]]
[1] "Rice"   "Walnut"

[[6]]
[1] "Sweet Potato" "Walnut"      

[[7]]
[1] "Rice"         "Sweet Potato" "Walnut"      

根据产品类型组合,这是我预期的出现频率输出数据:

  Combination Frequency
1           R         2
2           S         1
3           W         1
4         R,S         1
5         S,W         2
6         R,W         1
7       R,S,W         1

DFOUTa <- structure(list(Combination = c("R", "S", "W", "R,S", "S,W", "R,W", 
"R,S,W"), Frequency = c(2, 1, 1, 1, 2, 1, 1)), .Names = c("Combination", 
"Frequency"), row.names = c(NA, 7L), class = "data.frame")

这是我在垃圾箱中的收入的预期输出数据(即产品类型的组合):

  Combination Revenue
1           R       5
2           S       4
3           W       7
4         R,S       5
5         S,W      17
6         R,W       5
7       R,S,W      16

DFOUTb <- structure(list(Combination = c("R", "S", "W", "R,S", "S,W", "R,W", 
"R,S,W"), Revenue = c(5, 4, 7, 5, 17, 5, 16)), .Names = c("Combination", 
"Revenue"), row.names = c(NA, 7L), class = "data.frame")

我手动生成了以上数据。我已经仔细检查以确保没有错误。

我不确定如何生成我正在寻找的两个输出。我真诚地感谢任何帮助。我更喜欢基于data.table的方法,因为我原始数据集中的数据大小。

PS:我在输出文件中分别将产品名称RiceSweet PotatoWalnut缩短为RSW为了简洁起见。

2 个答案:

答案 0 :(得分:2)

这可以为您提供频率和收入 - 我假设您希望将每个客户的订单合并为一个组合:

require(data.table); setDT(DFI)

DFI[order(Product)
  ][, .(Combination= paste(Product, collapse=", "), Revenue = sum(Revenue)) , by=.(Customer)
  ][, .(.N, Revenue= sum(Revenue)), by=.(Combination)]

                  Combination N Revenue
1: Rice, Sweet Potato, Walnut 1      16
2:               Rice, Walnut 1       5
3:                       Rice 2       5
4:         Rice, Sweet Potato 1       5
5:       Sweet Potato, Walnut 2      17
6:               Sweet Potato 1       4
7:                     Walnut 1       7

您可能会发现一次查看每个链式语句有助于查看每个步骤中发生的情况。我要提到的唯一具体事情是我们从DFI[order(Product)]开始,以确保我们生成的组合是一致的,所以我们最终不会得到“Rice,Potato”“马铃薯,米饭“

答案 1 :(得分:1)

我会......

# spin off product table, assign abbreviations
prodDF = DFI[, .(Product = unique(Product))][, prod := substr(Product, 1, 1)]
DFI[prodDF, on=.(Product), prod := i.prod]

# spin off customer table, assign their bundles and revenues
custDF = DFI[order(prod), .(Bundle = toString(prod)), keyby=Customer]    
custDF[DFI[, sum(Revenue), by=.(Customer)], rev := i.V1]

# aggregate from customers to bundles
res = custDF[, .(.N, rev = sum(rev)), keyby=Bundle]

# clean up extra columns
DFI[, prod := NULL]

给出了

    Bundle N rev
1:       R 2   5
2:    R, S 1   5
3: R, S, W 1  16
4:    R, W 1   5
5:       S 1   4
6:    S, W 2  17
7:       W 1   7

这与@ Mako的答案非常相似,但是......

  1. 我的两个汇总在汇总收入时使用?GForce,而Mako在客户级别的收入汇总则没有。
  2. 这样可以留下客户表,您可以检查或合并其他客户属性(如果有的话);和产品表一样。
  3. 这些并没有真正使这个答案更好,只是不同。尽管是GForce的事情,我的方式实际上可能会变得更慢,因为我将客户分组或合并三次与单一时间的答案相比。对于第二个问题,另一个答案可能是简单/个人品味的单行。