Question

我有一个数据框，其中包含一列表示事件ID。还有另一列指示该事件中使用的产品。每个产品只能用于一次活动，每个活动至少包含一个产品。我想知道每种产品与其他产品一起使用的次数。一些样本数据如下：

set.seed(1)
events <- paste('Event ', sample(1:4, size = 15, replace = TRUE), sep = '')
events <- events[order(events)]

prods <- paste('Product ', c(1, 2, 3, 4, 1, 5, 6, 2, 4, 6, 7, 1, 2, 3, 5))

test_data <- data.frame(events, prods)
test_data
  events      prods
1  Event 1 Product  1
2  Event 1 Product  2
3  Event 1 Product  3
4  Event 1 Product  4
5  Event 2 Product  1
6  Event 2 Product  5
7  Event 2 Product  6
8  Event 3 Product  2
9  Event 3 Product  4
10 Event 3 Product  6
11 Event 3 Product  7
12 Event 4 Product  1
13 Event 4 Product  2
14 Event 4 Product  3
15 Event 4 Product  5

产品1和产品2在同一事件中发生两次（事件1和事件4）。所以我想要回复一个＆＃39; 2＆＃39;为那场比赛。产品1和产品7从不出现在同一事件中，所以我想要为该对返回0。对于＆＃39;匹配＆＃39;在同一项目之间，我很乐意返回产品使用的总次数。

有两种格式是可能的，我没有偏好，我希望看到它返回。

一个简短的数据框架，其产品在顶部作为列标题运行，而侧面作为行标题运行。该数据框的主体将由匹配数填充。
一个长而窄的数据框，其中有两列用于表示产品配对的所有可能组合，然后是第三列表示它们匹配的次数。

我一直在试验expand.grid没有任何东西可以展示它。

谢谢！

Answer 1

按prods拆分events然后计算所有combn - inations，然后计算aggregate以获取每个组合的计数。

out <- t(do.call(cbind,
  lapply(split(as.character(test_data$prods), test_data$events), combn, 2))
)
aggregate(count ~ . , data=transform(out,count=1), FUN=sum)

#           X1         X2 count
#1  Product  1 Product  2     2
#2  Product  1 Product  3     2
#3  Product  2 Product  3     2
#4  Product  1 Product  4     1
#5  Product  2 Product  4     2
#6  Product  3 Product  4     1
#7  Product  1 Product  5     2
#8  Product  2 Product  5     1
#9  Product  3 Product  5     1
#10 Product  1 Product  6     1
#11 Product  2 Product  6     1
#12 Product  4 Product  6     1
#13 Product  5 Product  6     1
#14 Product  2 Product  7     1
#15 Product  4 Product  7     1
#16 Product  6 Product  7     1

Answer 2

也许这是用大锤来破解坚果，但你可以开采（频繁）项目集，其中包含其他花哨的东西。它可以像这样工作：

library(arules)
library(reshape2)
mat <- as(sapply(dcast(test_data, events~prods, fun.aggregate = length, value.var="prods")[, -1], as.logical), "transactions")
sets <- apriori(trans, parameter = list(supp = 0, conf = 0, minlen = 2, maxlen = 2, target = "frequent itemsets"))
df <- as(sets, "data.frame")
subset(transform(df, n=support*nrow(trans)), n>0, -support)
#                      items n
# 2  {Product  6,Product  7} 1
# 4  {Product  4,Product  7} 1
# 6  {Product  2,Product  7} 1
# 7  {Product  5,Product  6} 1
# 8  {Product  3,Product  5} 1
# 10 {Product  1,Product  5} 2
# 11 {Product  2,Product  5} 1
# 13 {Product  4,Product  6} 1
# 14 {Product  1,Product  6} 1
# 15 {Product  2,Product  6} 1
# 16 {Product  3,Product  4} 1
# 17 {Product  1,Product  3} 2
# 18 {Product  2,Product  3} 2
# 19 {Product  1,Product  4} 1
# 20 {Product  2,Product  4} 2
# 21 {Product  1,Product  2} 2

支持值显示包含两种产品的事件百分比。我将它与事务数相乘以获得频率计数。

根据事件列创建配对数的数据帧

2 个答案: