我有一个针对不同患者的数据表(“Spell”)和每个患者的几个温度(“Temp”)测量值(“Episode”)。我也有每个温度的日期和时间。
Spell Episode Date Temp
1 3 2-1-17 21:00 40
1 2 2-1-17 20:00 36
1 1 1-1-17 10:00 37
2 3 2-1-17 15:00 36
2 2 2-1-17 10:00 37
2 1 1-1-17 8:00 36
3 1 3-1-17 10:00 40
4 3 4-1-17 15:00 36
4 2 3-1-17 12:00 40
4 1 3-1-17 10:00 39
5 7 3-1-17 17:30 36
5 6 2-1-17 17:00 36
5 5 2-1-17 16:00 37
5 1 1-1-17 9:00 36
5 4 1-1-17 14:00 39
5 3 1-1-17 13:00 40
5 2 1-1-17 11:00 39
我有兴趣在最后一次测量之前24小时完成所有测量,我已经通过法术和反向日期对观察结果进行了分组,但我不确定如何使用相同的参考进行组内比较(在这种情况下,每组的第一行)。结果应该是:
Spell Episode Date Temp
1 3 2-1-17 21:00 40
1 2 2-1-17 20:00 36
2 3 2-1-17 15:00 36
2 2 2-1-17 10:00 37
3 1 3-1-17 10:00 40
4 3 4-1-17 15:00 36
5 7 3-1-17 17:30 36
非常感谢能指出正确方向的任何想法。
编辑:日期为d-m-yy H:M格式。这是来自数据的输入:
structure(list(Spell = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), Episode = c(3L, 2L, 1L, 3L,
2L, 1L, 1L, 3L, 2L, 1L, 7L, 6L, 5L, 1L, 4L, 3L, 2L), Date = c("2-1-17 21:00",
"2-1-17 20:00", "1-1-17 10:00", "2-1-17 15:00", "2-1-17 10:00",
"1-1-17 8:00", "3-1-17 10:00", "4-1-17 15:00", "3-1-17 12:00",
"3-1-17 10:00", "3-1-17 17:30", "2-1-17 17:00", "2-1-17 16:00",
"1-1-17 9:00", "1-1-17 14:00", "1-1-17 13:00", "1-1-17 11:00"
), Temp = c(40L, 36L, 37L, 36L, 37L, 36L, 40L, 36L, 40L, 39L,
36L, 36L, 37L, 36L, 39L, 40L, 39L)), .Names = c("Spell", "Episode",
"Date", "Temp"), class = c("data.table", "data.frame"), row.names = c(NA,
-17L), .internal.selfref = <pointer: 0x00000000001f0788>)
答案 0 :(得分:6)
library(dplyr)
df %>%
mutate(Date2 = as.numeric(strptime(df$Date, "%d-%m-%Y %H:%M"))) %>%
group_by(Spell) %>%
filter(Date2 >= (max(Date2) - 60*60*24)) %>%
select(-Date2)
答案 1 :(得分:5)
仅使用data.table
的解决方案:
# convert Date column to POSIXct
DT[,Date:=as.POSIXct(Date,format='%d-%m-%y %H:%M',tz='GMT')]
# filter the data.table
filteredDT <- DT[, .SD[as.numeric(difftime(max(Date),Date,units='hours')) <= 24], by = Spell]
> filteredDT
Spell Episode Date Temp
1: 1 3 2017-01-02 21:00:00 40
2: 1 2 2017-01-02 20:00:00 36
3: 2 3 2017-01-02 15:00:00 36
4: 2 2 2017-01-02 10:00:00 37
5: 3 1 2017-01-03 10:00:00 40
6: 4 3 2017-01-04 15:00:00 36
7: 5 7 2017-01-03 17:30:00 36
答案 2 :(得分:2)
mydata$Date <- as.POSIXct(mydata$Date, format = '%d-%m-%y %H:%M', tz='GMT')
mydata <- mydata[with(mydata, order(Spell, -as.numeric(Date))),]
index <- with(mydata, tapply(Date, Spell, function(x){x >= max(x) - as.difftime(1, unit="days")}))
mydata[unlist(index),]
Spell Episode Date Temp
1: 1 3 2017-01-02 21:00:00 40
2: 1 2 2017-01-02 20:00:00 36
4: 2 3 2017-01-02 15:00:00 36
5: 2 2 2017-01-02 10:00:00 37
7: 3 1 2017-01-03 10:00:00 40
8: 4 3 2017-01-04 15:00:00 36
11: 5 7 2017-01-03 17:30:00 36
答案 3 :(得分:1)
下面的解决方案使用了Hadley Wickham的lubridate()
包中的两个函数。这个包在处理日期和时间时非常方便,所以我想知道为什么它没有被用在任何其他答案中。
此外,使用data.table
是因为OP提供了data.table
类的样本数据。
library(data.table) # if not already loaded
# coerce Date to POSIXct
DT[, Date := lubridate::dmy_hm(Date)][
# for each, pick measurements within last 24 hours
, .SD[Date > max(Date) - lubridate::dhours(24L)], by = Spell][
# order, just for convenience
order(Spell, -Date)]
Spell Episode Date Temp 1: 1 3 2017-01-02 21:00:00 40 2: 1 2 2017-01-02 20:00:00 36 3: 2 3 2017-01-02 15:00:00 36 4: 2 2 2017-01-02 10:00:00 37 5: 3 1 2017-01-03 10:00:00 40 6: 4 3 2017-01-04 15:00:00 36 7: 5 7 2017-01-03 17:30:00 36
请注意,OP给出的预期结果显示了一个额外的行(Spell 5,Episode 6),它超出了24小时窗口。
由OP提供
DT <- structure(list(Spell = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), Episode = c(3L, 2L, 1L, 3L,
2L, 1L, 1L, 3L, 2L, 1L, 7L, 6L, 5L, 1L, 4L, 3L, 2L), Date = c("2-1-17 21:00",
"2-1-17 20:00", "1-1-17 10:00", "2-1-17 15:00", "2-1-17 10:00",
"1-1-17 8:00", "3-1-17 10:00", "4-1-17 15:00", "3-1-17 12:00",
"3-1-17 10:00", "3-1-17 17:30", "2-1-17 17:00", "2-1-17 16:00",
"1-1-17 9:00", "1-1-17 14:00", "1-1-17 13:00", "1-1-17 11:00"
), Temp = c(40L, 36L, 37L, 36L, 37L, 36L, 40L, 36L, 40L, 39L,
36L, 36L, 37L, 36L, 39L, 40L, 39L)), .Names = c("Spell", "Episode",
"Date", "Temp"), class = c("data.table", "data.frame"), row.names = c(NA, -17L))