Question

If I have a table of generic order transactions in R:

order_id   product_id   value
1000       A            100
1000       C            55
1000       D            75
1001       B            85
1001       A            35
1001       D            75
1002       B            70
1002       E            20

structure(list(order_id = c(1000L, 1000L, 1000L, 1001L, 1001L, 
1001L, 1002L, 1002L), product_id = structure(c(1L, 3L, 4L, 2L,1L, 4L, 2L, 5L), 
.Label = c("A", "B", "C", "D", "E"), class = "factor"), 
value = c(100L, 55L, 75L, 85L, 35L, 75L, 70L, 20L)), .Names = c("order_id","product_id", "value"), 
class = "data.frame", row.names = c(NA, -8L))

How would I get the count and/or average/summed value of product pairings over order_id, like:

product_id_one    product_id_two     count
A                 B                  1
A                 C                  1
A                 D                  2
A                 E                  0
B                 C                  0
B                 D                  1
B                 E                  1
C                 D                  1
C                 E                  0
D                 E                  0

or

product_id_one    product_id_two     value_average
A                 B                  175
A                 C                  55
A                 D                  142.5
A                 E                  0
B                 C                  0
B                 D                  160
B                 E                  90
C                 D                  130
C                 E                  0
D                 E                  0

except just looping over it or some similar iterative approach? Order of product ids should not be important.

Answer 1

我的解决方案（更新）：

require(data.table)
mydf <- structure(list(order_id = c(1000L, 1000L, 1000L, 1001L, 1001L, 
                                1001L, 1002L, 1002L), product_id = structure(c(1L, 3L, 4L, 2L, 
                                                                               1L, 4L, 2L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor"), 
                   value = c(100L, 55L, 75L, 85L, 35L, 75L, 70L, 20L)), .Names = c("order_id", 
                                                                                   "product_id", "value"), class = "data.frame", row.names = c(NA, 
                                                                                                                                               -8L))
mydf <- data.table(mydf,key="order_id")
mydf2 <- mydf[mydf,allow.cartesian=TRUE]
mydf2 <- mydf2[product_id!=i.product_id]
mydf2[,idx:=.I]
mydf2[,firstsecond:=paste0(min(as.character(product_id),as.character(i.product_id)),"_",max(as.character(product_id),as.character(i.product_id))),by=idx]
mydf2 <- mydf2[,.N,by=.(firstsecond,order_id,value)][,N:=NULL]
mydf3 <- mydf2[,.(count=length(unique(order_id)),value_average=sum(value)/length(unique(order_id))),by=firstsecond]
mydf3[,c("product1","product2"):=tstrsplit(firstsecond,"_")]
# firstsecond count value_average product1 product2
# 1:         A_C     1         155.0        A        C
# 2:         A_D     2         142.5        A        D
# 3:         C_D     1         130.0        C        D
# 4:         A_B     1         120.0        A        B
# 5:         B_D     1         160.0        B        D
# 6:         B_E     1          90.0        B        E

如果这可以解决您的问题，请告诉我。

Answer 2

根据您提供的示例数据，我看不到product_id_one：A和product_id_two：B与计数或平均值之间的任何关联。您可以添加更多细节吗？

否则，假设你想按order_id聚合，我可以建议使用data.table。

arules

如果您正在寻找关联，可能需要查看{{1}}算法，{{1}}包。

How to get count of product pairings per order?

2 个答案: