修改Apriori算法的数据输入格式

时间:2016-06-20 06:22:51

标签: r apriori arules

我有一个格式的数据集:

txn_id  prod_name
  223      milk 
  223      eggs  
  235      eggs
  235      bread
  235      butter

我正在尝试使用此数据来查找各种产品之间的相关性(Market Basket Analysis)。对于在R中使用Apriori算法,数据需要具有

格式
| prod_name | prod_name | prod_name |
  milk         eggs
  eggs         bread      butter

如何实现这一目标?

2 个答案:

答案 0 :(得分:0)

arules包具有此功能。如果您查看文档,请在transactions-class下找到:

## example 4: creating transactions from a data.frame with transaction IDs and items
a_df3 <- data.frame(
    TID = c(1,1,2,2,2,3),
    item=c("a","b","a","b","c", "b")
)
trans4 <- as(split(a_df3[,"item"], a_df3[,"TID"]), "transactions")

split重新排列数据,以便您拥有包含每行相同TID的所有项目的列表。

答案 1 :(得分:0)

您可以使用dplyrtidyr

library(dplyr)
library(tidyr)

adf <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
txn_id  prod_name
223      milk 
223      eggs  
235      eggs
235      bread
235      butter') %>% tbl_df

### For each transaction, a 'prod_name_key' is created for each 'prod_name'
adf %>%
  group_by(txn_id) %>%
  mutate(prod_name_key = paste0('prod_name_', 1:n())) %>%  # Creates key
  spread(prod_name_key, prod_name, fill = '')              # Reshapes data

## Source: local data frame [2 x 4]
## 
##   txn_id prod_name_1 prod_name_2 prod_name_3
##    (int)       (chr)       (chr)       (chr)
## 1    223        milk        eggs            
## 2    235        eggs       bread      butter

可能有一种更简洁的方法可以做到这一点,但这似乎可以满足您的要求。