Question

我的输入文件包含每行的事务。以下示例显示了输入文件的结构：

a
a
a,b
b
a,b
a,c
c
c

上面的输入文件有11个项目和8个项目集。此输入文件包含3个唯一项和5个唯一项集。我想计算每个唯一项集的频率。对于上面的输入文件，我想编写一个R脚本，生成类似于以下CSV文件的输出：

"a",0.25
"a,b",0.25
"c",0.25
"b",0.125
"a,c",0.125

报告显示输入事务文件中每个唯一项集的出现次数除以输入中项集的总数。请注意，报表已根据频率对项目集进行了排序。如何使用R计算输入事务文件中项目集的频率？

更新：我已经使用read.transactions和apriori方法计算了关联规则。我可以重用这些方法的结果来计算输入项集的频率。

Answer 1

dat <- read.table(text="a
a
a,b
b
a,b
a,c
c
c")
prop.table(table(dat$V1))

#    a   a,b   a,c     b     c 
#0.250 0.250 0.125 0.125 0.250 
 dat.prop <- as.data.frame( prop.table(table(dat$V1)) )
 dat.prop <- dat.prop[order(dat.prop$Freq, decreasing=TRUE), ]
 dat.prop
#-------- Added the order step as a revision
  Var1  Freq
1    a 0.250
2  a,b 0.250
5    c 0.250
3  a,c 0.125
4    b 0.125
#---------

 write.table(dat.prop, file="dat.prop.csv", sep=",", header=FALSE)

Answer 2

这很简单：

Data <- read.table(header=TRUE, text="
itemset
a
a
a,b
b
a,b
a,c
c
c")

cbind(table(Data), table(Data) / nrow(Data))

## EDIT: Include sorting by observed proportion
T <- table(Data)                        # observed freq.
T <- cbind(T, T/nrow(Data))             # combine freq. and prop.
T <- T[order(T[,2], decreasing=TRUE),]  # sort
colnames(T) <- c("freq", "prop")        # add column names

Answer 3

如果输入数据位于名为“dat.txt”的文件中，则此代码可以正常工作。输出将位于名为“out.csv”的同一目录中的文件中。

Y=read.table('dat.txt')
Y=as.character(unlist(Y))
U=unique(Y)
n=length(U)
F=rep(0,n)
for(i in 1:n) F[i] = mean(Y==U[i])
D=cbind(U,F)
colnames(D)=c("Value","Frequency")
write.csv(D,'out.csv')

我很抱歉这段代码既不漂亮也不评论。

Answer 4

使用plyr

的另一种解决方案

library(plyr)
ddply(dat, "V1", summarize, Freq = length(V1)/NROW(dat))

   V1  Freq
1   a 0.250
2 a,b 0.250
3 a,c 0.125
4   b 0.125
5   c 0.250

如何使用R计算项目集的频率？

4 个答案: