R-检查每个id / loop每天的第一个和最后一个记录

时间:2014-05-22 07:06:13

标签: r loops optimization logic

如何查找同一ID内一天的第一个和最后一个记录?

示例数据:

sample<- data.frame(
id=c("A","A","A","A","A","C","C","C","D","D","E","E"),
location=c("US","US","US","US","SINGAPORE","CHINA","CHINA","JAPAN","JAPAN","JAPAN","SINGAPORE","SINGAPORE"),
Date =c("03/03/2013","03/03/2013","04/03/2013","04/03/2013","05/03/2013","03/03/2013","04/03/2013","04/03/2013","05/03/2013","06/03/2013","05/03/2013","05/03/2013")

) 

目前我尝试使用查找第一个和最后一个记录。 但看起来好像不太合适。

尝试代码:

sample$FIRST <- !duplicated(sample$id)
sample$LAST<-FALSE
sample$LAST <- c(sample$id[-nrow(sample)]==sample$id[-1],TRUE)

如何实现代码以实现我的预期结果?

预期结果:

sample<- data.frame(
id=c("A","A","A","A","A","C","C","c","D","D","E","E"),
Date=c("03/03/2013","03/03/2013","04/03/2013","04/03/2013","05/03/2013","03/03/2013","04/03/2013","04/03/2013","05/03/2013","06/03/2013","05/03/2013","05/03/2013"),
FIRST =c("TRUE","FALSE","TRUE","FALSE","TRUE","TRUE","TRUE","FALSE","TRUE","TRUE","TRUE","FALSE"),
LAST =c("FALSE","TRUE","FALSE","TRUE","TRUE","TRUE","FALSE","TRUE","TRUE","TRUE","FALSE","TRUE")
)

提前致谢... 注意:由于数据量巨大,需要考虑优化......

1 个答案:

答案 0 :(得分:0)

这是使用dplyr执行此操作的一种方法:

sample$Date <- as.Date(sample$Date, format="%d/%m/%Y")     #convert to dates

require(dplyr)

sample <- sample %.%                       #take the data.frame `sample` and store result of the following operation in `sample`. As I mentioned in the comment, the %.% chains several operations together.
  group_by(id, Date) %.%                   #now we make groups of the data for each combination of id and Date 
  mutate(count = 1:n(),                    #mutate adds new columns to sample. `count` counts the number of rows per group of id Date combination 
    FIRST = count == 1,                    #the column FIRST is TRUE if column count == 1
    LAST = count == max(count))            #the column LAST is TRUE if column count == maximum count per group, which is the last row

sample$count <- NULL                     #since you dont need the count column, you can remove it later on

输出结果为:

#   id  location       Date FIRST  LAST
#1   A        US 2013-03-03  TRUE FALSE 
#2   A        US 2013-03-03 FALSE  TRUE
#3   A        US 2013-03-04  TRUE FALSE
#4   A        US 2013-03-04 FALSE  TRUE
#5   A SINGAPORE 2013-03-05  TRUE  TRUE
#6   C     CHINA 2013-03-03  TRUE  TRUE
#7   C     CHINA 2013-03-04  TRUE FALSE
#8   C     JAPAN 2013-03-04 FALSE  TRUE       <<-- this line had a lower case "c" as id in the input, I corrected it to capital C, otherwise it would produce a separate id
#9   D     JAPAN 2013-03-05  TRUE  TRUE
#10  D     JAPAN 2013-03-06  TRUE  TRUE
#11  E SINGAPORE 2013-03-05  TRUE FALSE
#12  E SINGAPORE 2013-03-05 FALSE  TRUE