过滤组内给定行元素下方的行

时间:2014-11-04 11:23:41

标签: r

我有一个数据框,其中每行都有一组与日期相关联的事件。在每个位置内,我有一个索引事件和一系列可能在索引事件之前和/或之后发生的各种匹配事件。我需要为每个位置的索引事件之前发生的所有匹配事件进行子集化。数据结构如下所示。

locid    match      date          score      iid
1        index      4/11/2013      15        1
1        matched    1/09/2013      23        2
1        matched    14/04/2013      1        3
1        matched    7/1/2014       21        4
2        index      2/4/2013       12        1
2        matched    1/2/2013       10        2
3        index      1/5/2013       23        1
3        matched    2/5/2013       10        2
4        index      3/3/2013        9        1
4        matched    10/2/2013      32        2
4        matched    1/10/2012      15        3
4        matched    4/3/2013       12        4
4        matched    10/3/2013      10        5

我需要对数据帧进行子集化,这样我最终只能得到每个位置的索引事件日期以下日期的行:

locid    match      date          score      iid
1        matched    1/09/2013      23        2
1        matched    14/04/2013      1        3
2        matched    1/2/2013       10        2
4        matched    10/2/2013      32        2
4        matched    1/10/2012      15        3

我第一次问这里,所以我希望我不是以错误的方式做这件事。我在R中尝试了各种解决方案,但我很难找到合适的解决方案。

2 个答案:

答案 0 :(得分:5)

这里有data.table的可能性(假设您的数据名为df

library(data.table)
setDT(df)[, date := as.Date(date, format = "%d/%m/%Y")][, 
           .SD[date < date[match == "index"]], by = locid]
#    locid   match       date score iid
# 1:     1 matched 2013-09-01    23   2
# 2:     1 matched 2013-04-14     1   3
# 3:     2 matched 2013-02-01    10   2
# 4:     4 matched 2013-02-10    32   2
# 5:     4 matched 2012-10-01    15   3

可能的基础R解决方案

df <- transform(df, date = as.Date(date, format = "%d/%m/%Y"))
do.call(rbind, by(df, df$locid, FUN = function(x) x[with(x, date < date[match == "index"]), ]))
#      locid   match       date score iid
# 1.2      1 matched 2013-09-01    23   2
# 1.3      1 matched 2013-04-14     1   3
# 2        2 matched 2013-02-01    10   2
# 4.10     4 matched 2013-02-10    32   2
# 4.11     4 matched 2012-10-01    15   3

另一种可能的基础R解决方案

df <- transform(df, date = as.Date(date, format = "%d/%m/%Y"))
do.call(rbind, lapply(split(df, df$locid), function(x) x[with(x, date < date[match == "index"]), ]))
#      locid   match       date score iid
# 1.2      1 matched 2013-09-01    23   2
# 1.3      1 matched 2013-04-14     1   3
# 2        2 matched 2013-02-01    10   2
# 4.10     4 matched 2013-02-10    32   2
# 4.11     4 matched 2012-10-01    15   3

此处的基本想法是将您的date列转换为Date类,以便R能够识别它的顺序。之后,我们基本上将数据分割为locid并对每个块应用过滤函数,该函数仅选择date之前的日期match == index

答案 1 :(得分:3)

以下是使用dplyr执行此操作的方法:

require(dplyr)
df %>%
  mutate(date = as.Date(date, format = "%d/%m/%Y")) %>%
  group_by(locid) %>%
  filter(match == "matched" & date < date[match == "index"])

#Source: local data frame [5 x 5]
#Groups: locid
#
#  locid   match       date score iid
#1     1 matched 2013-09-01    23   2
#2     1 matched 2013-04-14     1   3
#3     2 matched 2013-02-01    10   2
#4     4 matched 2013-02-10    32   2
#5     4 matched 2012-10-01    15   3

首先将日期转换为真实的Date - 格式,然后按列locid对数据进行分组,然后过滤所有这些行,matched == "matched"date之前的行索引日期。

注意:严格来说,您可以从过滤器参数中删除match == "matched",因为您已经过滤了日期为<的所有行而不是索引日期(因此,可能没有带索引的行),但我暂时将其保留在那里,因为我觉得它更易于阅读,如果你将条件改为例如<=,在这种情况下需要指定match == "matched"如果你不想要索引行。

数据:

df <- structure(list(locid = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 
4L, 4L, 4L, 4L), match = structure(c(1L, 2L, 2L, 2L, 1L, 2L, 
1L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("index", "matched"), class = "factor"), 
    date = structure(c(11L, 1L, 7L, 13L, 8L, 3L, 4L, 9L, 10L, 
    5L, 2L, 12L, 6L), .Label = c("1/09/2013", "1/10/2012", "1/2/2013", 
    "1/5/2013", "10/2/2013", "10/3/2013", "14/04/2013", "2/4/2013", 
    "2/5/2013", "3/3/2013", "4/11/2013", "4/3/2013", "7/1/2014"
    ), class = "factor"), score = c(15L, 23L, 1L, 21L, 12L, 10L, 
    23L, 10L, 9L, 32L, 15L, 12L, 10L), iid = c(1L, 2L, 3L, 4L, 
    1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 5L)), .Names = c("locid", 
"match", "date", "score", "iid"), class = "data.frame", row.names = c(NA, 
-13L))