基于给定列的行的总和来子集化数据帧

时间:2018-04-21 18:45:56

标签: r

我正在处理具有三个变量(即id,时间,性别)的数据。它看起来像

df <-
  structure(
    list(
      id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
      time = c(21L, 3L, 4L, 9L, 5L, 9L, 10L, 6L, 27L, 3L, 4L, 10L),
      gender = c(1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)
    ),
    .Names = c("id", "time", "gender"),
    class = "data.frame",
    row.names = c(NA,-12L)
  )

也就是说,每个id都有四个关于时间和性别的观察。我希望根据可变时间行的总和在R中对这些数据进行子集,该时间首先给出每个id大于或等于25的值。请注意,对于id 2,将包括所有观察,对于id 3,仅涉及第一次观察。预期结果如下:

df <-
  structure(
    list(
      id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L ),
      time = c(21L, 3L, 4L, 5L, 9L, 10L, 6L, 27L ),
      gender = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L)
    ),
    .Names = c("id", "time", "gender"),
    class = "data.frame",
    row.names = c(NA,-8L)
  )

对此的任何帮助都非常感谢。

3 个答案:

答案 0 :(得分:1)

可以试试dplyr构建:

dt <- groupby(df, id) %>%
#sum time within  groups
mutate(sum_time = cumsum(time))%>% 
#'select' rows, which fulfill the condition
filter(sum_time < 25) %>% 
#exclude sum_time column from the result
select (-sum_time)

答案 1 :(得分:1)

一种选择是使用lag cumsum作为:

library(dplyr)

df %>% group_by(id,gender) %>%
  filter(lag(cumsum(time), default = 0) < 25 )

# # A tibble: 8 x 3
# # Groups: id, gender [3]
# id  time gender
# <int> <int>  <int>
# 1     1    21      1
# 2     1     3      1
# 3     1     4      1
# 4     2     5      0
# 5     2     9      0
# 6     2    10      0
# 7     2     6      0
# 8     3    27      1

使用data.table :(根据@Renu的反馈更新)

library(data.table)

setDT(df)

df[,.SD[shift(cumsum(time), fill = 0) < 25], by=.(id,gender)]

答案 2 :(得分:1)

另一种选择是为每个&#39; cumsum(time) >= 25创建一个逻辑向量,TRUE时间为cumsum&# 39;等于或大于25

然后您可以filter查看此向量的cumsum小于或等于1的行,即为每个&#的第一个TRUE过滤条目39; ID&#39;

df %>% 
 group_by(id) %>% 
 filter(cumsum( cumsum(time) >= 25 ) <= 1)
# A tibble: 8 x 3
# Groups:   id [3]
#      id  time gender
#   <int> <int>  <int>
# 1     1    21      1
# 2     1     3      1
# 3     1     4      1
# 4     2     5      0
# 5     2     9      0
# 6     2    10      0
# 7     2     6      0
# 8     3    27      1