我有一个大约100,000个一起订购的商品列表,我已粘贴到一列中,因此我可以计算每个组合发生的次数。
4845 Curly Fries California Burger 1
4846 French Fries California Burger 1
4847 Hamburger California Burger 1
4848 $1 Fountain Drinks Curly Fries 1
4849 $1 Fountain Drinks Curly Fries 1
4850 California Burger Curly Fries 1
4851 Curly Fries Curly Fries 1
我已经探索了聚合函数,它给出了以下错误:
aggregate(t1$count,list(t1$pc), sum) <br>
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list? <br>
我也尝试了ddply的变体:
ddply(t1,t1$pc,transform,occurances=sum(t1$count))
但是我收到了这个错误
Error in UseMethod("as.quoted") :
no applicable method for 'as.quoted' applied to an object of class "c('matrix', 'list')"
我假设我得到了这个,因为我试图用字符值来“分组”。我还根据类似问题的答案探讨了tapply
和recast
,但无济于事。
如何获得这些组合数?
供考虑时,单独列出的项目样本(同样,格式问题道歉):
Var1 Var2 Var3
>2 Onion Rings Onion Rings 1
>3 Pineapple Cheddar Burger Onion Rings 1
>4 Onion Rings Pineapple Cheddar Burger 1
>5 Pineapple Cheddar Burger Pineapple Cheddar Burger 1
>5 Onion Rings Onion Rings 1
>6 Pineapple Cheddar Burger Onion Rings 1
>7 Onion Rings Pineapple Cheddar Burger 1
>8 Pineapple Cheddar Burger Pineapple Cheddar Burger 1
>9 Fountain Soda Fountain Soda 1
>10 French Fries Fountain Soda 1
答案 0 :(得分:4)
table()
功能在这里很有用:
with(t1, table(pc)) ## or equivalently table(t1$pc)
这假设pc
是您想要计算出现次数的因子变量。 (如果它不是一个因素,它将被强制为一个。)
答案 1 :(得分:1)
您的初步方法与我认为您想要的非常接近。将这些组合成一个因素肯定会有效,只要你按照相同的顺序组合它们,这样你就不会得到“Fries,Burger”和“Burger,Fries”。
可能有一种更简单的方式来做你想做的事情,但我没有想到那是什么。不过,我认为这可以满足您的需求:
# Let's assume your data looks like this:
> df
Var1 Var2 Var3
1 Onion Rings Onion Rings 1
2 Pineapple Cheddar Burger Onion Rings 1
3 Onion Rings Pineapple Cheddar Burger 1
4 Pineapple Cheddar Burger Pineapple Cheddar Burger 1
5 Onion Rings Onion Rings 1
6 Pineapple Cheddar Burger Onion Rings 1
7 Onion Rings Pineapple Cheddar Burger 1
8 Pineapple Cheddar Burger Pineapple Cheddar Burger 1
9 Fountain Soda Fountain Soda 1
10 French Fries Fountain Soda 1
# Now, for each row
# 1. sort the Var1 and Var2,
# 2. combine the sorted vars, and
# 3. convert them back into a factor
df$sortcomb <- as.factor(apply(df[,1:2], 1, function(x) paste(sort(x), collapse=", ")))
table(df$sortcomb) # then use table as per normal
ddply(df, .(sortcomb), summarize, count=length(sortcomb)) # or ddply