如何查找同一ID内一天的第一个和最后一个记录?
示例数据:
sample<- data.frame(
id=c("A","A","A","A","A","C","C","C","D","D","E","E"),
location=c("US","US","US","US","SINGAPORE","CHINA","CHINA","JAPAN","JAPAN","JAPAN","SINGAPORE","SINGAPORE"),
Date =c("03/03/2013","03/03/2013","04/03/2013","04/03/2013","05/03/2013","03/03/2013","04/03/2013","04/03/2013","05/03/2013","06/03/2013","05/03/2013","05/03/2013")
)
目前我尝试使用查找第一个和最后一个记录。 但看起来好像不太合适。
尝试代码:
sample$FIRST <- !duplicated(sample$id)
sample$LAST<-FALSE
sample$LAST <- c(sample$id[-nrow(sample)]==sample$id[-1],TRUE)
如何实现代码以实现我的预期结果?
预期结果:
sample<- data.frame(
id=c("A","A","A","A","A","C","C","c","D","D","E","E"),
Date=c("03/03/2013","03/03/2013","04/03/2013","04/03/2013","05/03/2013","03/03/2013","04/03/2013","04/03/2013","05/03/2013","06/03/2013","05/03/2013","05/03/2013"),
FIRST =c("TRUE","FALSE","TRUE","FALSE","TRUE","TRUE","TRUE","FALSE","TRUE","TRUE","TRUE","FALSE"),
LAST =c("FALSE","TRUE","FALSE","TRUE","TRUE","TRUE","FALSE","TRUE","TRUE","TRUE","FALSE","TRUE")
)
提前致谢... 注意:由于数据量巨大,需要考虑优化......
答案 0 :(得分:0)
这是使用dplyr执行此操作的一种方法:
sample$Date <- as.Date(sample$Date, format="%d/%m/%Y") #convert to dates
require(dplyr)
sample <- sample %.% #take the data.frame `sample` and store result of the following operation in `sample`. As I mentioned in the comment, the %.% chains several operations together.
group_by(id, Date) %.% #now we make groups of the data for each combination of id and Date
mutate(count = 1:n(), #mutate adds new columns to sample. `count` counts the number of rows per group of id Date combination
FIRST = count == 1, #the column FIRST is TRUE if column count == 1
LAST = count == max(count)) #the column LAST is TRUE if column count == maximum count per group, which is the last row
sample$count <- NULL #since you dont need the count column, you can remove it later on
输出结果为:
# id location Date FIRST LAST
#1 A US 2013-03-03 TRUE FALSE
#2 A US 2013-03-03 FALSE TRUE
#3 A US 2013-03-04 TRUE FALSE
#4 A US 2013-03-04 FALSE TRUE
#5 A SINGAPORE 2013-03-05 TRUE TRUE
#6 C CHINA 2013-03-03 TRUE TRUE
#7 C CHINA 2013-03-04 TRUE FALSE
#8 C JAPAN 2013-03-04 FALSE TRUE <<-- this line had a lower case "c" as id in the input, I corrected it to capital C, otherwise it would produce a separate id
#9 D JAPAN 2013-03-05 TRUE TRUE
#10 D JAPAN 2013-03-06 TRUE TRUE
#11 E SINGAPORE 2013-03-05 TRUE FALSE
#12 E SINGAPORE 2013-03-05 FALSE TRUE