我对R很新,我有一个问题,对于这里的专家来说可能非常简单。
假设我有一个表“sales”,其中包括4个客户ID(123-126)和4个产品(A,B,C,D)。
ID A B C D
123 0 1 1 0
124 1 1 0 0
125 1 1 0 1
126 0 0 0 1
我想计算产品之间的重叠。因此,对于A,同时具有A和B的ID的数量将为2.类似地,A和C之间的重叠将为0,并且A和D之间的重叠将为1.这是我的A和B重叠的代码:< / p>
overlap <- sales [which(sales [,"A"] == 1 & sales [,"B"] == 1 ),]
countAB <- count(overlap,"ID")
我想对所有4种产品重复此计算,因此A与B,C,D和B重叠,与A,C,D等重叠......如何更改代码来实现此目的?
我希望最终输出为每个双产品组合的ID数。这是产品亲和力练习,我想找出一种产品,哪种产品最畅销。例如,对于A,使用它的销售最多的产品将是B,然后是D,然后是C.需要将一些排序添加到代码中以实现此目的。
感谢您的帮助!
答案 0 :(得分:2)
这是一个可能的解决方案:
sales <-
read.csv(text=
"ID,A,B,C,D
123,0,1,1,0
124,1,1,0,0
125,1,1,0,1
126,0,0,0,1")
# get product names
prods <- colnames(sales)[-1]
# generate all products pairs (and transpose the matrix for convenience)
combs <- t(combn(prods,2))
# turn the combs into a data.frame with column P1,P2
res <- as.data.frame(combs)
colnames(res) <- c('P1','P2')
# for each combination row :
# - subset sales selecting only the products in the row
# - count the number of rows summing to 2 (if sum=2 the 2 products have been sold together)
# N.B.: length(which(logical_condition)) can be implemented with sum(logical_condition)
# since TRUE and FALSE are automatically coerced to 1 and 0
# finally add the resulting vector to the newly created data.frame
res$count <- apply(combs,1,function(comb){sum(rowSums(sales[,comb])==2)})
> res
P1 P2 count
1 A B 2
2 A C 0
3 A D 1
4 B C 1
5 B D 1
6 C D 0
答案 1 :(得分:2)
#x1 is your dataframe
x1<-structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L,
1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID",
"A", "B", "C", "D"), class = "data.frame", row.names = c(NA,
-4L))
#get the combination of all colnames but the first ("ID")
k1<-combn(colnames(x1[,-1]),2)
#create two lists a1 and a2 so that we can iterate over each element
a1<-as.list(k1[seq(1,length(k1),2)])
a2<-as.list(k1[seq(2,length(k1),2)])
# your own functions with varying i and j
mapply(function(i,j) length(x1[which(x1[,i] == 1 & x1 [,j] == 1 ),1]),a1,a2)
[1] 2 0 1 1 1 0
答案 2 :(得分:2)
您可以使用矩阵乘法:
m <- as.matrix(d[-1])
z <- melt(crossprod(m,m))
z[as.integer(z$X1) < as.integer(z$X2),]
# X1 X2 value
# 5 A B 2
# 9 A C 0
# 10 B C 1
# 13 A D 1
# 14 B D 1
# 15 C D 0
其中d
是您的数据框:
d <- structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", "A", "B", "C", "D"), class = "data.frame", row.names = c(NA, -4L))
<强> [更新] 强>
要计算产品亲和力,您可以执行以下操作:
z2 <- subset(z,X1!=X2)
do.call(rbind,lapply(split(z2,z2$X1),function(d) d[which.max(d$value),]))
# X1 X2 value
# A A B 2
# B B A 2
# C C B 1
# D D A 1
答案 3 :(得分:1)
您可能需要查看arules包。它完全符合您的要求。 提供用于表示,操作和分析事务数据和模式(频繁项集和关联规则)的基础结构。还提供了C. Borgelt的关联挖掘算法Apriori和Eclat的C实现的接口。