如何将向量化交易转换为二进制矩阵交易

时间:2018-10-30 08:10:09

标签: r

我有一个名为data.frame的{​​{1}},只有一个名为transactions的字段,因此第i行由一个带有第i个交易项的向量组成,它看起来像这样: / p>

items

我想将其转换为二进制矩阵,以便每个元素都说明如果已为给定交易购买了给定对象,则它应如下所示:

> head(transactions)
                                              items
1                                       Cake, Fudge
2                                       Coffee, Tea
3                                Coffee, Choco, Tea
4                                            Coffee
5                                Bread, Muffin, Jam
6                                            Coffee

我找不到没有阴暗的嵌套for循环的方法。这都是从 Cake Fudge Coffee Tea Choco Bread Muffin Jam 1 1 1 0 0 0 0 0 0 2 0 0 1 1 0 0 0 0 3 0 0 1 1 1 0 0 0 4 0 0 1 0 0 0 0 0 5 0 0 0 0 0 1 1 1 6 0 0 1 0 0 0 0 0 包中申请apriori的全部内容,如果你们中的任何一个可以帮助我的话,将不胜感激。

谢谢!

3 个答案:

答案 0 :(得分:3)

我们可以创建新的列以将每一行(row)和要代表的值(如果存在的值是1(spread_value)进行分组。我们使用separate_rows将每个逗号分隔的值分成单独的行。然后,我们spread的值从长到宽,如果没有值,我们将fill设为0。

library(tidyverse)

df %>%
  mutate(row = row_number(), spread_value = 1) %>%
  separate_rows(items, sep = ",") %>%
  mutate(items = trimws(items)) %>%
  spread(items, spread_value, fill = 0) %>%
  select(-row)


#  Bread Cake Choco Coffee Fudge Jam Muffin Tea
#1     0    1     0      0     1   0      0   0
#2     0    0     0      1     0   0      0   1
#3     0    0     1      1     0   0      0   1
#4     0    0     0      1     0   0      0   0
#5     1    0     0      0     0   1      1   0
#6     0    0     0      1     0   0      0   0

答案 1 :(得分:2)

splitstackshape中的cSplit_e函数。

df1 <- splitstackshape::cSplit_e(
  data = df,
  split.col = "items",
  sep = ", ",
  mode = "binary",
  fixed = TRUE,
  type = "character",
  fill = 0L,
  drop = TRUE
)

names(df1) <- sub("^items_", "", names(df1))
df1
#  Bread Cake Choco Coffee Fudge Jam Muffin Tea
#1     0    1     0      0     1   0      0   0
#2     0    0     0      1     0   0      0   1
#3     0    0     1      1     0   0      0   1
#4     0    0     0      1     0   0      0   0
#5     1    0     0      0     0   1      1   0
#6     0    0     0      1     0   0      0   0

数据

df <- structure(list(items = c("Cake, Fudge", "Coffee, Tea", "Coffee, Choco, Tea", 
"Coffee", "Bread, Muffin, Jam", "Coffee")), .Names = "items", class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

答案 2 :(得分:0)

一个非dplyr的选择:

library(magrittr)
library(stringr)

uniq_words <- df[["items"]] %>% 
  strsplit(", ") %>% 
  unlist() %>%
  unique()

sol <- outer(df[["items"]], uniq_words, str_detect) * 1L
colnames(sol) <- uniq_words

sol
     Cake Fudge Coffee Tea Choco Bread Muffin Jam
[1,]    1     1      0   0     0     0      0   0
[2,]    0     0      1   1     0     0      0   0
[3,]    0     0      1   1     1     0      0   0
[4,]    0     0      1   0     0     0      0   0
[5,]    0     0      0   0     0     1      1   1
[6,]    0     0      1   0     0     0      0   0

数据

df <- data.frame(
  items = c(
    "Cake, Fudge", "Coffee, Tea", "Coffee, Choco, Tea", 
    "Coffee", "Bread, Muffin, Jam", "Coffee"
  ),
  stringsAsFactors = FALSE
)