对不起我的英语,我需要一些帮助。
使用此数据集:
+--------+------------+---------+---------+----------+
| PEOPLE | DATE | EVENT_A | EVENT_B | BEVENT_C |
+--------+------------+---------+---------+----------+
| MIKE | 04/08/2013 | 1 | 1 | 1 |
| PETE | 10/08/2013 | 1 | 0 | 1 |
| PETE | 25/08/2013 | 1 | 0 | 1 |
| PETE | 15/09/2013 | 1 | 0 | 1 |
| MIKE | 28/09/2013 | 1 | 1 | 1 |
| PETE | 19/10/2013 | 1 | 1 | 1 |
| MIKE | 30/10/2013 | 0 | 1 | 1 |
| MIKE | 09/11/2013 | 1 | 1 | 1 |
+--------+------------+---------+---------+----------+
基本上我需要计算按n事件分组的组合数,其值为1.我不知道在R中采用什么方法来实现这一点。输出应该是这样的:
+-------+-------+------------------------+---------+---------+--------+
| #MIKE | #PETE | #N EVENTS COMBINATIONS | | | |
+-------+-------+------------------------+---------+---------+--------+
| 3 | 1 | COMBINATIONS WITH 2 | EVENT A | EVENT B | |
| 2 | 4 | COMBINATIONS WITH 2 | EVENT A | EVENT C | |
| 4 | 1 | COMBINATIONS WITH 2 | EVENT B | EVENT C | |
| 3 | 2 | COMBINATIONS WITH 3 | EVENT A | EVENT B | EVENT C|
+-------+-------+------------------------+---------+---------+--------+
我需要为每个人和任意数量的独特事件(列)
提前致谢 文斯。
答案 0 :(得分:0)
一种可能性是使用dplyr
,管道和tidyr
(详情了解here和here)。
鉴于您的数据,我会像这样解决您的问题:
library(dplyr) # for data manipulation and piping
library(tidyr) # for data reshaping
# 1. create the data
df <- data_frame(
people = c("Mike", "Pete", "Pete", "Pete", "Mike", "Pete", "Mike", "Mike"),
event_a = c(rep(1, 6), 0, 1),
event_b = c(1, 0, 0, 0, rep(1, 4)),
event_c = c(rep(1, 8))
)
# create a dummy var for each event-combination
df2 <- df %>%
mutate(ab = event_a & event_b,
ac = event_a & event_c,
bc = event_b & event_c,
abc = event_a & event_b & event_c)
# reshape data to the long format using tidyr::gather
df3 <- df2 %>%
# we dont need the original events anymore -> deselect them
select(-contains("event")) %>%
# reshape from long to short
gather("var", "value", -people) %>%
# filter only the positive matches
filter(value == T)
df3 %>%
# for each combination ...
group_by(var) %>%
# ... count the number of cases
summarise(n_mike = sum(people == "Mike"),
n_pete = sum(people == "Pete")) %>%
# create the text-variable
mutate(event_combs = sprintf("Combinations with %d", nchar(var))) %>%
# reorder to have it your format
select(n_mike, n_pete, event_combs, var)
#> # A tibble: 4 x 4
#> n_mike n_pete event_combs var
#> <int> <int> <chr> <chr>
#> 1 3 1 Combinations with 2 ab
#> 2 3 1 Combinations with 3 abc
#> 3 3 4 Combinations with 2 ac
#> 4 4 1 Combinations with 2 bc
要将此概括为任意*多个事件(* {当我们使用letters
时为*最多26个,要扩展为剩余的练习...),我们可以使用expand.grid()
生成所有可能的事件组合,然后使用apply
过滤相应的组合。
代码如下所示:
df <- data_frame(
people = c("Mike", "Pete", "Pete", "Pete", "Mike", "Pete", "Mike", "Mike"),
event_a = c(rep(1, 6), 0, 1),
event_b = c(1, 0, 0, 0, rep(1, 4)),
event_c = c(rep(1, 8)),
event_d = c(1, 1, 0, 0, 0, 1, 0, 1)
)
# take only the events
df_events <- df %>% select(starts_with("event"))
# create all possible event combinations
# also: discard the first rows (all-zeros)
event_combs <- expand.grid(rep(list(0:1), ncol(df_events)))[-1, ]
# 'loop' over the possible combinations, and find the matches
res_list <- apply(event_combs, 1, function(row) {
# row now contains which events we choose
row <- as.logical(row)
# var now contains the names of the events. i.e., 'a', 'abc', or 'bc'
var <- paste(letters[1:length(row)][row], collapse = "")
# combine the data into a data_frame
data_frame(var = var,
people = df$people,
# check if per row all selected events are true
value = rowSums(df_events[, row]) == sum(row))
})
# bind the results together
df3 <- bind_rows(res_list) %>% filter(value == T)
# same as before...
df3 %>%
group_by(var) %>%
summarise(n_mike = sum(people == "Mike"),
n_pete = sum(people == "Pete"))
#> # A tibble: 15 x 3
#> var n_mike n_pete
#> <chr> <int> <int>
#> 1 a 3 4
#> 2 ab 3 1
#> 3 abc 3 1
#> 4 abcd 2 1
#> 5 abd 2 1
#> 6 ac 3 4
#> 7 acd 2 2
#> 8 ad 2 2
#> 9 b 4 1
#> 10 bc 4 1
#> 11 bcd 2 1
#> 12 bd 2 1
#> 13 c 4 4
#> 14 cd 2 2
#> 15 d 2 2