在R中,对于所有列计数因子/字符出现,按键分组

时间:2015-06-13 23:08:09

标签: r

此问题可能需要使用data.tabledplyr解决。我有一个数据集(数据框),如下所示:

summary(mooc_events)
 signup_id         time                source               event        
 Min.   :     1   Min.   :2013-10-27   browser:3869940   access    :3112191  
 1st Qu.: 18721   1st Qu.:2013-12-19   server :4287337   discussion: 649259  
 Median : 48331   Median :2014-05-30                     navigate  :1009309  
 Mean   : 63476   Mean   :2014-04-05                     page_close:1237883  
 3rd Qu.:110375   3rd Qu.:2014-06-15                     problem   :1261170  
 Max.   :200905   Max.   :2014-08-01                     video     : 796958  
                                                         wiki      :  90507  
    artefact_sha         
 Length:8157277    
 Class :character  
 Mode  :character  

一个signup_id有多个事件,因此有许多行以相同的signup_id开头。

我想要实现的是获得一个聚合数据集(data.table或数据框),其列数与每个特定列的不同值一样多,所有列都按signup_id分组,因此对于此数据它看起来像这样:

signup_id, source_browser, source_server, event_access, event_discussion, ... , event_wiki, artefact_sha_{first_element_in_whole_dataset}, ..., artefact_sha_{last_element_in_whole_dataset}

1, 23, 37, 9, 0, ..., 3, 7, ..., 1
2, 2, 7, 2, 2, ..., 1, 0, ..., 0

换句话说,它计算给定列集的出现次数,按单列signup_id分组,我在分组时并不感兴趣。 signup_id和source。

列命名不严格(_可以替换为有意义的任何内容。)

(我们暂时跳过时间栏)

致以最诚挚的问候和感谢。

2 个答案:

答案 0 :(得分:3)

它更像是一个可以使用tidyr和reshape2库解决的重塑问题。

用tidyr重塑并用reshape2计算出现次数:

我的示例并未包含artefact_sha,因为我不了解您想要用它做什么。

library(dplyr) # Or library(magrittr) for the pipe syntax
library(tidyr)
library(reshape2)

set.seed(42)
mooc_events <- data.frame(signup_id = rep(1:3, each = 5), 
                    time = Sys.Date(), 
                    source = sample(c("browser", "server"), 15, rep = TRUE), 
                    event = sample(c("access", "discussion", "navigate"), 15, rep = TRUE), 
                    stringsAsFactors = FALSE)

mooc_events.m <- 
  mooc_events %>% 
  gather(key, value, -c(signup_id, time)) %>% 
  unite(var, key, value, sep = "_")

myTable <- dcast(mooc_events.m, signup_id ~ var, fun.aggregate = length)

> myTable
  signup_id event_access event_discussion event_navigate source_browser source_server
1         1            1                2              2              1             4
2         2            2                0              3              1             4
3         3            0                3              2              3             2

答案 1 :(得分:1)

也许这会奏效。它是dplyrreshape2的组合。这只会生成一些变量。要包含您想要计算的变量的其余部分,只需将它们添加到group_by调用和dcast,即。 dcast(tst, signup_id ~ source+event+...)

library(dplyr)
library(reshape2)

## First get counts for groupings of variables
tst <- mooc_events %>% group_by(signup_id, source, event) %>%
  dplyr::summarise(count=n())

## Then reshape data long -> wide
dcast(tst, signup_id ~ source+event)

#    signup_id browser_access browser_navigate browser_video browser_wiki
# 1          1              2               NA             1            2
# 2          2             NA               NA             2           NA
# 3          3              3                1            NA            3
# ...
#    server_access server_navigate server_video server_wiki
# 1             NA               1            3           1
# 2              3               2           NA           1
# 3             NA               4           NA           5

## Some sample data
mooc_events <- data.frame(
    signup_id=sample(1:10, 100, replace=T), 
    source=factor(sample(c("browser", "server"), 100, replace=T)),
    event=factor(sample(c("access","navigate","video","wiki"), 100, replace=T))
)
head(mooc_events)

#   signup_id  source    event
# 1         5 browser     wiki
# 2         4  server navigate
# 3         1 browser navigate
# 4         7 browser   access
# 5         8  server   access
# 6         5 browser     wiki