If I have a table of generic order transactions in R:
order_id product_id value
1000 A 100
1000 C 55
1000 D 75
1001 B 85
1001 A 35
1001 D 75
1002 B 70
1002 E 20
structure(list(order_id = c(1000L, 1000L, 1000L, 1001L, 1001L,
1001L, 1002L, 1002L), product_id = structure(c(1L, 3L, 4L, 2L,1L, 4L, 2L, 5L),
.Label = c("A", "B", "C", "D", "E"), class = "factor"),
value = c(100L, 55L, 75L, 85L, 35L, 75L, 70L, 20L)), .Names = c("order_id","product_id", "value"),
class = "data.frame", row.names = c(NA, -8L))
How would I get the count and/or average/summed value of product pairings over order_id, like:
product_id_one product_id_two count
A B 1
A C 1
A D 2
A E 0
B C 0
B D 1
B E 1
C D 1
C E 0
D E 0
or
product_id_one product_id_two value_average
A B 175
A C 55
A D 142.5
A E 0
B C 0
B D 160
B E 90
C D 130
C E 0
D E 0
except just looping over it or some similar iterative approach? Order of product ids should not be important.
答案 0 :(得分:1)
我的解决方案(更新):
require(data.table)
mydf <- structure(list(order_id = c(1000L, 1000L, 1000L, 1001L, 1001L,
1001L, 1002L, 1002L), product_id = structure(c(1L, 3L, 4L, 2L,
1L, 4L, 2L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor"),
value = c(100L, 55L, 75L, 85L, 35L, 75L, 70L, 20L)), .Names = c("order_id",
"product_id", "value"), class = "data.frame", row.names = c(NA,
-8L))
mydf <- data.table(mydf,key="order_id")
mydf2 <- mydf[mydf,allow.cartesian=TRUE]
mydf2 <- mydf2[product_id!=i.product_id]
mydf2[,idx:=.I]
mydf2[,firstsecond:=paste0(min(as.character(product_id),as.character(i.product_id)),"_",max(as.character(product_id),as.character(i.product_id))),by=idx]
mydf2 <- mydf2[,.N,by=.(firstsecond,order_id,value)][,N:=NULL]
mydf3 <- mydf2[,.(count=length(unique(order_id)),value_average=sum(value)/length(unique(order_id))),by=firstsecond]
mydf3[,c("product1","product2"):=tstrsplit(firstsecond,"_")]
# firstsecond count value_average product1 product2
# 1: A_C 1 155.0 A C
# 2: A_D 2 142.5 A D
# 3: C_D 1 130.0 C D
# 4: A_B 1 120.0 A B
# 5: B_D 1 160.0 B D
# 6: B_E 1 90.0 B E
如果这可以解决您的问题,请告诉我。
答案 1 :(得分:0)
根据您提供的示例数据,我看不到product_id_one:A和product_id_two:B与计数或平均值之间的任何关联。您可以添加更多细节吗?
否则,假设你想按order_id聚合,我可以建议使用data.table。
arules
如果您正在寻找关联,可能需要查看{{1}}算法,{{1}}包。