使用Tidyr / Dplyr汇总字符串组的计数

时间:2015-07-16 11:43:24

标签: r dplyr tidyr

我需要总结一下我分配给群组的字符串数量,我知道我可以在dplyr / tidyr中完成,但我错过了一些东西。

示例数据集:

Owner = c('bob','julia','cheryl','bob','julia','cheryl')
Day = c('Mon', 'Tue') 
Locn = c('house','store','apartment','office','house','shop')
data <- data.frame(Owner, Day, Locn)

看起来像这样:

   Owner Day      Locn
1    bob Mon     house
2  julia Tue     store
3 cheryl Mon apartment
4    bob Tue    office
5  julia Mon     house
6 cheryl Tue      shop

我想按名称和日期分组,然后按列计算分组位置。在这个例子中,我希望'house'和'apartment'添加到标题为'Home'的列中,'store','office'和'shop'将计入“Work”列中。

我当前的代码(不起作用):

grouped_locn <- data %>%
  dplyr::arrange(Owner, Day) %>%
  dplyr::group_by(Owner, Day) %>%
  dplyr::summarize(Home = which(data$Locn %in% c('house', 'apartment')), 
               Work = which(data$Locn %in% c("store", "office", "apartment")))

我只是将我目前的尝试包括在总结步骤中,以显示我是如何接近它的。 Home和Work代码当前返回包含该组元素的行号的向量(即Home = 1 3 5)

我的预期输出:

   Owner Day   Home  Work
1    bob Mon      1     0
2    bob Tue      0     1
3  julia Mon      1     0
4  julia Tue      0     1
5 cheryl Mon      1     0
6 cheryl Tue      0     1

在实际数据集(30k +行)中,每个所有者每天有多个Locn值,因此Home和Work计数可以是1和0以外的数字(因此没有布尔值)。

非常感谢。

4 个答案:

答案 0 :(得分:10)

使用data.table

,这是一个简单有效的解决方案

旧版本(v <1.9.5)

library(data.table) # v < 1.9.5
setDT(data)[, Locn2 := c("Work", "Home")[(Locn %in% c('house', 'apartment')) + 1L]]
dcast.data.table(data, Owner + Day ~ Locn2, length)
#     Owner Day Home Work
# 1:    bob Mon    1    0
# 2:    bob Tue    0    1
# 3: cheryl Mon    1    0
# 4: cheryl Tue    0    1
# 5:  julia Mon    1    0
# 6:  julia Tue    0    1

对于较新版本(v&gt; = 1.9.5),您可以在一行中执行此操作

dcast(setDT(data), Owner + Day ~ c("Work", "Home")[(Locn %in% c('house', 'apartment')) + 1L], length)

这是tidyr替代

library(dplyr)
library(tidyr)
data %>%
  mutate(temp = 1L, 
         Locn = ifelse(Locn %in% c('house', 'apartment'), "Home", "Work")) %>% 
  spread(Locn, temp, fill = 0L)

#    Owner Day Home Work
# 1    bob Mon    1    0
# 2    bob Tue    0    1
# 3 cheryl Mon    1    0
# 4 cheryl Tue    0    1
# 5  julia Mon    1    0
# 6  julia Tue    0    1

答案 1 :(得分:7)

试试这个

data %>%
  group_by(Owner, Day) %>%
  summarise(Home = sum(Locn %in% c("house", "apartment")), 
            Work = sum(Locn %in% c("store", "office", "shop")))

答案 2 :(得分:4)

您可以使用model.matrix

中的base R
data[c('Work', 'Home')] <- model.matrix(~0+indx, transform(data, 
       indx =  Locn %in% c('house', 'apartment')))

   data
 #   Owner Day      Locn Work Home
 #1    bob Mon     house    0    1
 #2  julia Tue     store    1    0
 #3 cheryl Mon apartment    0    1
 #4    bob Tue    office    1    0
 #5  julia Mon     house    0    1
 #6 cheryl Tue      shop    1    0

 library(qdapTools)
 data[c('Work', 'Home')] <- mtabulate(data$Locn %in% c('house', 'apartment'))

答案 3 :(得分:2)

这就像@lukeA提出的解决方案,但使用grepl函数:

library(dplyr)

data %<>% arrange(Owner, Day) %>% group_by(Owner, Day) %>%
  summarise(Home=sum((grepl("house|apartment", Locn))*1), 
            Work=sum((grepl("store|office|shop", Locn))*1))