我有一组看起来像这样的数据:
Date boolean
407 2006-06-01 1
408 2006-06-02 1
409 2006-06-03 1
410 2006-06-04 NA
411 2006-06-05 0
412 2006-06-06 1
413 2006-06-07 1
414 2006-06-08 0
415 2006-06-09 1
由此,我试图创建一个新的数据框,它将显示我的1次运行的日期以及这些运行的时间长度,列标题:1)开始日期,2)结束日期, 3)跑步的长度。
最终,我想根据上面的数据创建一个看起来像这样的数据框:
Start Date End Date Length of Run
1 2006-06-01 2006-06-03 3
2 2006-06-06 2006-06-07 2
我的数据中有一些NA,我需要在整个数据中忽略它。
答案 0 :(得分:4)
您可以使用dplyr
执行此操作,使用mutate
将缺失的boolean
值转换为0,group_by
以计算具有常量变量boolean
值的组,filter
限制boolean
设置为1的组以及组中有多个成员的组,然后summarize
获取相关的摘要信息。 (我采取了一些额外的步骤来删除最后的分组变量。)
library(dplyr)
dat %>%
mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
filter(boolean == 1 & n() > 1) %>%
summarize("Start Date"=min(as.character(Date)),
"End Date"=max(as.character(Date)),
"Length of Run"=n()) %>%
ungroup() %>%
select(-matches("group"))
# Start Date End Date Length of Run
# (chr) (chr) (int)
# 1 2006-06-01 2006-06-03 3
# 2 2006-06-06 2006-06-07 2
数据:
dat <- read.table(text=" Date boolean
407 2006-06-01 1
408 2006-06-02 1
409 2006-06-03 1
410 2006-06-04 NA
411 2006-06-05 0
412 2006-06-06 1
413 2006-06-07 1
414 2006-06-08 0
415 2006-06-09 1", header=T)
答案 1 :(得分:2)
我们还可以根据需要使用data.table
对数据进行子集化和转换。首先,我们使用id
创建rleid(boolean)
列。接下来,根据必要条件对数据进行子集化。最后,我们使用子集化数据创建start
,end
和run
:
library(data.table)
setDT(dat)[,id := rleid(boolean)][
,.SD[.N > 1 & boolean == 1],id][
,.(start=Date[1],end=Date[.N], run=.N),id]
# id start end run
#1: 1 2006-06-01 2006-06-03 3
#2: 4 2006-06-06 2006-06-07 2
答案 2 :(得分:1)
使用base的另一个答案,重新格式化this answer使用cumsum
和diff
。
#Remove ineligible dates (defined by 0 or NA)
x1 <- x[!(x$boolean %in% c(NA, 0)), ]
x1$Date <- as.Date(x1$Date) #Convert date from factor to Date class
#Starting at 0, if the difference between eligible dates is >1 day,
# add 1 (TRUE) to the previous value, else add 0 (FALSE) to previous value
#This consecutively numbers each series
x1$SeriesNo <- cumsum(c(0, diff(x1$Date) > 1))
# Date boolean SeriesNo
#407 2006-06-01 1 0
#408 2006-06-02 1 0
#409 2006-06-03 1 0
#412 2006-06-06 1 1
#413 2006-06-07 1 1
#415 2006-06-09 1 2
# Aggregate: Perform the function FUN on variable Date by each SeriesNo group
x2 <- as.data.frame(as.list(
aggregate(Date ~ SeriesNo, data= x1, FUN=function(zz)
c(Start = min(zz), End= max(zz), Run = length(zz) ))
)) #see note after this code block
#Output is in days since origin. Reconvert them into Date class
x2$Date.Start <- as.Date(x2$Date.Start, origin = "1970-01-01")
x2$Date.End <- as.Date(x2$Date.End, origin = "1970-01-01")
# SeriesNo Date.Start Date.End Date.Run
#1 0 2006-06-01 2006-06-03 3
#2 1 2006-06-06 2006-06-07 2
#3 2 2006-06-09 2006-06-09 1
关于&#34;马车&#34;来自aggregate
的输出:Using aggregate to apply several functions on several variables in one call