我正在寻找R的解决方案。
我有一个数据框:
df <- data.frame(
Number = rep(1:5, c(1, 1, 2, 5, 2)),
Date = c("4/8/2010", "4/8/2010","4/15/2010", "4/21/2010",
"4/24/2010", "6/9/2010", "6/2/2010","6/25/2010",
"6/30/2010", "7/9/2010", "7/28/2010"),
Time = c("15:00:00", "16:00:00", "10:30:00","16:15:00",
"11:30:00", "12:00:00", "11:00:00", "10:30:00",
"09:07:44", "08:49:43", "08:33:55"),
Status = c("A", NA, NA, "B", NA, "B",
NA, NA, "C", NA, "C"),
stringsAsFactors = FALSE)
根据唯一的“数字”列,如何选择最早和最晚的日期(有时最新日期相同,但时间不同),并选择最后(最新)状态。
理想的结果将是:
非常感谢。
答案 0 :(得分:3)
## NA will cause problems later, so set to 0 first
df$Status[is.na(df$Status)] <- 0
## Get earliest and latest date time
earliest <- aggregate(cbind(Date, Time, Status) ~ Number, data=df, function(x){min(as.character(x))})
latest <- aggregate(cbind(Date, Time, Status) ~ Number, data=df, function(x){max(as.character(x))})
## merge two data frames by Number
output <- merge(earliest, latest, all=TRUE, by="Number")
## Set Status to nonzero observations
output$Status <- ifelse(output$Status.x!=0, output$Status.x, output$Status.y)
## Remove redundant last date
output$LastDate <- ifelse(output$Date.x==output$Date.y & output$Time.x==output$Time.y, "", output$Date.y)
## Remove redundant last time
output$LastTime <- ifelse(output$Date.x==output$Date.y & output$Time.x==output$Time.y, "", output$Time.y)
## Select relevant output
final <- subset(output, select=c(Number, Date.x, Time.x, LastDate, LastTime, Status))
## Rename columns
names(final)[2:3] <- c("FirstDate", "FirstTime")
## Set Status back to NA
final$Status[final$Status==0] <- NA
最终输出与您所描述的相似:
> final
Number FirstDate FirstTime LastDate LastTime Status
1 1 4/8/2010 15:00:00 A
2 2 4/8/2010 16:00:00 <NA>
3 3 4/15/2010 10:30:00 4/21/2010 16:15:00 B
4 4 4/24/2010 09:07:44 6/9/2010 12:00:00 C
5 5 7/28/2010 08:33:55 7/9/2010 08:49:43 C