Question

我正在分析不同购物场所之间的客户流。我有这样的数据：

df <- data.frame(customer.id=letters[seq(1,7)], 
                 shop.1=c(1,1,1,1,1,0,0),
                 shop.2=c(0,0,1,1,1,1,0),
                 shop.3=c(1,0,0,0,0,0,1))
df

#>   customer.id shop.1 shop.2 shop.3
#> 1           a      1      0      1
#> 2           b      1      0      0  
#> 3           c      1      1      0 
#> 4           d      1      1      0 
#> 5           e      1      1      0 
#> 6           f      0      1      0 
#> 7           g      0      0      1

例如，

仅在1号和3号商店购物的客户“ a”
客户“ b”仅在1号商店购物，
客户“ c”仅在1号和2号商店购物，
等

我想像这样总结数据：

#>           shop.1 shop.2 shop.3 
#> shop.1         5      3      1
#> shop.2         3      4      0       
#> shop.3         1      0      2

例如，第1行显示为：

在1号商店和1号商店都购物了5个人（这显然是多余的观察结果）
3个人在1号商店和2号商店购物
在1号店和3号店都购物的人

如何做到这一点（请注意：我的数据集中有很多商店，因此首选可扩展的方法）？

Answer 1

crossprod可以照顾您想要做的事，经过一些基本的操作之后，将其分为两列分别代表customer和shop：

tmp <- cbind(df[1],stack(df[-1]))
tmp <- tmp[tmp$values==1,]

crossprod(table(tmp[c(1,3)]))

#        ind
#ind      shop.1 shop.2 shop.3
#  shop.1      5      3      1
#  shop.2      3      4      0
#  shop.3      1      0      2

Answer 2

您要列出shop.*变量的同时出现：

df[,2:4] <- sapply(df[,2:4], function(x) { ifelse(x=="", 0, 1) } )

1）可以使用ftable(xtabs(...))来完成，但是我为此苦苦挣扎了很长时间，无法做到。我最接近的是：

> ftable(xtabs(~ shop.1 + shop.2 + shop.3, df))

              shop.3 0 1
shop.1 shop.2           
0      0             0 1
       1             1 0
1      0             1 1
       1             3 0

2）如@thelatemail所示，您还可以：

# Transform your df from wide-form to long-form...
library(dplyr)
library(reshape2)
occurrence_df <- reshape2::melt(df, id.vars='customer.id') %>%
                 dplyr::filter(value==1)

   customer.id variable value
1            a   shop.1     1
2            b   shop.1     1
3            c   shop.1     1
4            d   shop.1     1
5            e   shop.1     1
6            c   shop.2     1
7            d   shop.2     1
8            e   shop.2     1
9            f   shop.2     1
10           a   shop.3     1
11           g   shop.3     1

实际上，我们可以在过滤器之后放置value列，因此我们可以通过管道%>% select(-value)

   customer.id variable
1            a   shop.1
2            b   shop.1
3            c   shop.1
4            d   shop.1
5            e   shop.1
6            c   shop.2
7            d   shop.2
8            e   shop.2
9            f   shop.2
10           a   shop.3
11           g   shop.3

＃然后执行与@thelatemail答案相同的跨程序步骤：

crossprod(table(occurrence_df))

        variable
variable shop.1 shop.2 shop.3
  shop.1      5      3      1
  shop.2      3      4      0
  shop.3      1      0      2

（脚注：

首先，您的数据应为数字（或因数），而不是字符串。您要将“ x”转换为1，将“”转换为0。
如果它们是字符串，因为它们来自read.csv，请使用read.csv参数stringsAsFactors=TRUE使其成为因数，或使用colClasses使其成为数字，并查看所有对此有重复的问题。）

Answer 3

实际上，矩阵运算似乎足够了，因为数据帧仅包含0和1。

首先，排除customer.id列，并将data.frame更改为matrix。这可能很容易。（mydf是数据框的名称。）

# base R way
as.matrix(mydf[,-1])
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

library(dplyr) #dplyr way
(mymat <-
  mydf %>% 
  select(-customer.id) %>% 
  as.matrix())
#>      shop.1 shop.2 shop.3
#> [1,]      1      0      1
#> [2,]      1      0      0
#> [3,]      1      1      0
#> [4,]      1      1      0
#> [5,]      1      1      0
#> [6,]      0      1      0
#> [7,]      0      0      1

使用此矩阵，只需执行以下矩阵操作即可。

t(mymat) %*% mymat
#>        shop.1 shop.2 shop.3
#> shop.1      5      3      1
#> shop.2      3      4      0
#> shop.3      1      0      2

您可以得到答案。

如何使用R汇总这些数据？

3 个答案: