此问题可能需要使用data.table
或dplyr
解决。我有一个数据集(数据框),如下所示:
summary(mooc_events)
signup_id time source event
Min. : 1 Min. :2013-10-27 browser:3869940 access :3112191
1st Qu.: 18721 1st Qu.:2013-12-19 server :4287337 discussion: 649259
Median : 48331 Median :2014-05-30 navigate :1009309
Mean : 63476 Mean :2014-04-05 page_close:1237883
3rd Qu.:110375 3rd Qu.:2014-06-15 problem :1261170
Max. :200905 Max. :2014-08-01 video : 796958
wiki : 90507
artefact_sha
Length:8157277
Class :character
Mode :character
一个signup_id有多个事件,因此有许多行以相同的signup_id开头。
我想要实现的是获得一个聚合数据集(data.table或数据框),其列数与每个特定列的不同值一样多,所有列都按signup_id
分组,因此对于此数据它看起来像这样:
signup_id, source_browser, source_server, event_access, event_discussion, ... , event_wiki, artefact_sha_{first_element_in_whole_dataset}, ..., artefact_sha_{last_element_in_whole_dataset}
1, 23, 37, 9, 0, ..., 3, 7, ..., 1
2, 2, 7, 2, 2, ..., 1, 0, ..., 0
换句话说,它计算给定列集的出现次数,按单列signup_id
分组,我在分组时并不感兴趣。 signup_id和source。
列命名不严格(_
可以替换为有意义的任何内容。)
(我们暂时跳过时间栏)
致以最诚挚的问候和感谢。
答案 0 :(得分:3)
它更像是一个可以使用tidyr和reshape2库解决的重塑问题。
用tidyr重塑并用reshape2计算出现次数:
我的示例并未包含artefact_sha
,因为我不了解您想要用它做什么。
library(dplyr) # Or library(magrittr) for the pipe syntax
library(tidyr)
library(reshape2)
set.seed(42)
mooc_events <- data.frame(signup_id = rep(1:3, each = 5),
time = Sys.Date(),
source = sample(c("browser", "server"), 15, rep = TRUE),
event = sample(c("access", "discussion", "navigate"), 15, rep = TRUE),
stringsAsFactors = FALSE)
mooc_events.m <-
mooc_events %>%
gather(key, value, -c(signup_id, time)) %>%
unite(var, key, value, sep = "_")
myTable <- dcast(mooc_events.m, signup_id ~ var, fun.aggregate = length)
> myTable
signup_id event_access event_discussion event_navigate source_browser source_server
1 1 1 2 2 1 4
2 2 2 0 3 1 4
3 3 0 3 2 3 2
答案 1 :(得分:1)
也许这会奏效。它是dplyr
和reshape2
的组合。这只会生成一些变量。要包含您想要计算的变量的其余部分,只需将它们添加到group_by
调用和dcast
,即。 dcast(tst, signup_id ~ source+event+...)
library(dplyr)
library(reshape2)
## First get counts for groupings of variables
tst <- mooc_events %>% group_by(signup_id, source, event) %>%
dplyr::summarise(count=n())
## Then reshape data long -> wide
dcast(tst, signup_id ~ source+event)
# signup_id browser_access browser_navigate browser_video browser_wiki
# 1 1 2 NA 1 2
# 2 2 NA NA 2 NA
# 3 3 3 1 NA 3
# ...
# server_access server_navigate server_video server_wiki
# 1 NA 1 3 1
# 2 3 2 NA 1
# 3 NA 4 NA 5
## Some sample data
mooc_events <- data.frame(
signup_id=sample(1:10, 100, replace=T),
source=factor(sample(c("browser", "server"), 100, replace=T)),
event=factor(sample(c("access","navigate","video","wiki"), 100, replace=T))
)
head(mooc_events)
# signup_id source event
# 1 5 browser wiki
# 2 4 server navigate
# 3 1 browser navigate
# 4 7 browser access
# 5 8 server access
# 6 5 browser wiki