如何计算R数据帧中事件的重现率

时间:2019-06-21 02:38:41

标签: r dataframe

我有一个数据框dt,其中有数千个重复事件可能只在一个位置或在两个位置都发生过。我如何计算/计算仅在两个位置都发生的事件数。例如,在下面的示例dt中,我们可以看到2事件(ev2ev3)在较高和较低位置均发生,因此计数为2。

dt<-structure(list(event = c("ev1", "ev1", "ev2", "ev2", "ev2", "ev2", 
"ev2", "ev3", "ev3", "ev3", "ev3", "ev3", "ev3", "ev3", "ev3", 
"ev3", "ev3", "ev3", "ev3", "ev6", "ev6", "ev6", "ev6", "ev6", 
"ev8", "ev8", "ev8", "ev11", "ev11", "ev17"), location = c("Lower", 
"Lower", "Lower", "Lower", "Higher", "Higher", "Higher", "Lower", 
"Higher", "Higher", "Lower", "Lower", "Lower", "Lower", "Lower", 
"Lower", "Lower", "Lower", "Lower", "Lower", "Lower", "Lower", 
"Lower", "Lower", "Higher", "Higher", "Higher", "Lower", "Lower", 
"Lower")), .Names = c("event", "location"), row.names = c(NA, 
-30L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(
    cols = structure(list(event = structure(list(), class = c("collector_character", 
    "collector")), location = structure(list(), class = c("collector_character", 
    "collector"))), .Names = c("event", "location")), default = structure(list(), class = c("collector_guess", 
    "collector"))), .Names = c("cols", "default"), class = "col_spec")) 

4 个答案:

答案 0 :(得分:1)

我们可以找出event上出现的location

library(dplyr)

dt %>%
  group_by(event) %>%
  filter(all(c("Lower", "Higher") %in% location)) %>%
  pull(event) %>% unique()

#[1] "ev2" "ev3"

或者如果您想要计数

dt %>%
  group_by(event) %>%
  filter(all(c("Lower", "Higher") %in% location)) %>%
  pull(event) %>% n_distinct()
#[1] 2

在基数R中,我们可以使用aggregate

df1 <- aggregate(location~event, dt, function(x) all(c("Lower", "Higher") %in% x))

df1$event[df1$location]
#[1] "ev2" "ev3"

length(df1$event[df1$location])
#[1] 2

答案 1 :(得分:1)

Ronak的方法更健壮,大声笑,但您也可以删除同时重复的行,然后在事件列中查找重复项:

temp_df <- dt[!duplicated(dt[c("event","location")]),]
sum(duplicated(temp_df$event))
[1] 2

答案 2 :(得分:0)

您还可以将唯一的行粘贴在一起,并使用regexpr计算重复的前缀。

sum(table(regmatches(v <- unique(apply(dt, 1, paste, collapse="")), regexpr("\\d+", v))) > 1)
# [1] 2

答案 3 :(得分:0)

我们可以使用data.table

library(data.table)
nrow(setDT(dt)[, .GRP[sum(c("Lower", "Higher") %in% location) == 2], event])
#[1] 2

或与dplyr

library(dplyr)
dt %>%
    filter(location %in% c("Lower", "Higher")) %>% 
    distinct %>% 
    count(event) %>% 
    filter(n == 2) %>% 
    nrow
#[1] 2

或使用base R

sum(rowSums(table(unique(dt))) == 2)
#[1] 2