我有一个格式的数据集:
txn_id prod_name
223 milk
223 eggs
235 eggs
235 bread
235 butter
我正在尝试使用此数据来查找各种产品之间的相关性(Market Basket Analysis)。对于在R中使用Apriori算法,数据需要具有
格式| prod_name | prod_name | prod_name |
milk eggs
eggs bread butter
如何实现这一目标?
答案 0 :(得分:0)
arules
包具有此功能。如果您查看文档,请在transactions-class
下找到:
## example 4: creating transactions from a data.frame with transaction IDs and items
a_df3 <- data.frame(
TID = c(1,1,2,2,2,3),
item=c("a","b","a","b","c", "b")
)
trans4 <- as(split(a_df3[,"item"], a_df3[,"TID"]), "transactions")
split
重新排列数据,以便您拥有包含每行相同TID的所有项目的列表。
答案 1 :(得分:0)
您可以使用dplyr
和tidyr
。
library(dplyr)
library(tidyr)
adf <- read.table(header = TRUE, stringsAsFactors = FALSE, text = '
txn_id prod_name
223 milk
223 eggs
235 eggs
235 bread
235 butter') %>% tbl_df
### For each transaction, a 'prod_name_key' is created for each 'prod_name'
adf %>%
group_by(txn_id) %>%
mutate(prod_name_key = paste0('prod_name_', 1:n())) %>% # Creates key
spread(prod_name_key, prod_name, fill = '') # Reshapes data
## Source: local data frame [2 x 4]
##
## txn_id prod_name_1 prod_name_2 prod_name_3
## (int) (chr) (chr) (chr)
## 1 223 milk eggs
## 2 235 eggs bread butter
可能有一种更简洁的方法可以做到这一点,但这似乎可以满足您的要求。