我有一个数据框,其中每行都有一组与日期相关联的事件。在每个位置内,我有一个索引事件和一系列可能在索引事件之前和/或之后发生的各种匹配事件。我需要为每个位置的索引事件之前发生的所有匹配事件进行子集化。数据结构如下所示。
locid match date score iid
1 index 4/11/2013 15 1
1 matched 1/09/2013 23 2
1 matched 14/04/2013 1 3
1 matched 7/1/2014 21 4
2 index 2/4/2013 12 1
2 matched 1/2/2013 10 2
3 index 1/5/2013 23 1
3 matched 2/5/2013 10 2
4 index 3/3/2013 9 1
4 matched 10/2/2013 32 2
4 matched 1/10/2012 15 3
4 matched 4/3/2013 12 4
4 matched 10/3/2013 10 5
我需要对数据帧进行子集化,这样我最终只能得到每个位置的索引事件日期以下日期的行:
locid match date score iid
1 matched 1/09/2013 23 2
1 matched 14/04/2013 1 3
2 matched 1/2/2013 10 2
4 matched 10/2/2013 32 2
4 matched 1/10/2012 15 3
我第一次问这里,所以我希望我不是以错误的方式做这件事。我在R中尝试了各种解决方案,但我很难找到合适的解决方案。
答案 0 :(得分:5)
这里有data.table
的可能性(假设您的数据名为df
)
library(data.table)
setDT(df)[, date := as.Date(date, format = "%d/%m/%Y")][,
.SD[date < date[match == "index"]], by = locid]
# locid match date score iid
# 1: 1 matched 2013-09-01 23 2
# 2: 1 matched 2013-04-14 1 3
# 3: 2 matched 2013-02-01 10 2
# 4: 4 matched 2013-02-10 32 2
# 5: 4 matched 2012-10-01 15 3
可能的基础R解决方案
df <- transform(df, date = as.Date(date, format = "%d/%m/%Y"))
do.call(rbind, by(df, df$locid, FUN = function(x) x[with(x, date < date[match == "index"]), ]))
# locid match date score iid
# 1.2 1 matched 2013-09-01 23 2
# 1.3 1 matched 2013-04-14 1 3
# 2 2 matched 2013-02-01 10 2
# 4.10 4 matched 2013-02-10 32 2
# 4.11 4 matched 2012-10-01 15 3
另一种可能的基础R解决方案
df <- transform(df, date = as.Date(date, format = "%d/%m/%Y"))
do.call(rbind, lapply(split(df, df$locid), function(x) x[with(x, date < date[match == "index"]), ]))
# locid match date score iid
# 1.2 1 matched 2013-09-01 23 2
# 1.3 1 matched 2013-04-14 1 3
# 2 2 matched 2013-02-01 10 2
# 4.10 4 matched 2013-02-10 32 2
# 4.11 4 matched 2012-10-01 15 3
此处的基本想法是将您的date
列转换为Date
类,以便R能够识别它的顺序。之后,我们基本上将数据分割为locid
并对每个块应用过滤函数,该函数仅选择date
之前的日期match == index
答案 1 :(得分:3)
以下是使用dplyr执行此操作的方法:
require(dplyr)
df %>%
mutate(date = as.Date(date, format = "%d/%m/%Y")) %>%
group_by(locid) %>%
filter(match == "matched" & date < date[match == "index"])
#Source: local data frame [5 x 5]
#Groups: locid
#
# locid match date score iid
#1 1 matched 2013-09-01 23 2
#2 1 matched 2013-04-14 1 3
#3 2 matched 2013-02-01 10 2
#4 4 matched 2013-02-10 32 2
#5 4 matched 2012-10-01 15 3
首先将日期转换为真实的Date
- 格式,然后按列locid
对数据进行分组,然后过滤所有这些行,matched == "matched"
和date
之前的行索引日期。
注意:严格来说,您可以从过滤器参数中删除match == "matched"
,因为您已经过滤了日期为<
的所有行而不是索引日期(因此,可能没有带索引的行),但我暂时将其保留在那里,因为我觉得它更易于阅读,如果你将条件改为例如<=
,在这种情况下需要指定match == "matched"
如果你不想要索引行。
df <- structure(list(locid = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L,
4L, 4L, 4L, 4L), match = structure(c(1L, 2L, 2L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("index", "matched"), class = "factor"),
date = structure(c(11L, 1L, 7L, 13L, 8L, 3L, 4L, 9L, 10L,
5L, 2L, 12L, 6L), .Label = c("1/09/2013", "1/10/2012", "1/2/2013",
"1/5/2013", "10/2/2013", "10/3/2013", "14/04/2013", "2/4/2013",
"2/5/2013", "3/3/2013", "4/11/2013", "4/3/2013", "7/1/2014"
), class = "factor"), score = c(15L, 23L, 1L, 21L, 12L, 10L,
23L, 10L, 9L, 32L, 15L, 12L, 10L), iid = c(1L, 2L, 3L, 4L,
1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 5L)), .Names = c("locid",
"match", "date", "score", "iid"), class = "data.frame", row.names = c(NA,
-13L))