以下问题大部分来自数据框的庞大规模(198240观察)。我会尽力把它分解。
目标
我想创建一个变量DURATION,这是房子病了多久。
已知
问题 我不明白如何让函数/循环同时遍历家庭和时间中的数据帧。
我知道它将是一个函数或循环,如下所示(不是在R代码中,而是在逻辑中)
IF (hdinc > 0) #a house on a certain date is sick
{ Duration = 1 AND look at hdinc_1
IF (hdinc_1 = 0 )
{ Duration = Duration + 0
AND Go onto the next date for that house.
IF hdinc_1 >0 then #if the house was sick last week
{ Duration = Duration + 1
Go to SameHouse, Week-1 and look at hdinc_1 to see if it was sick the week prior
我遇到以下问题:
我知道这真的很复杂,但我甚至无法开始提供循环来提供你们所有的示例代码。
示例数据:
dat <- structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L), survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L), .Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"), hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)), .Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L))
示例输出:
答案 0 :(得分:0)
我们可以将函数rle
与dplyr
结合使用来查找运行,然后删除那些运行健康的内容:
library(dplyr)
dat %>% group_by(id_casa) %>%
mutate(duration = unlist(lapply(rle(hdinc > 0)[["lengths"]], seq, from = 1))) %>%
mutate(duration = ifelse(hdinc > 0, as.numeric(duration), 0))
Source: local data frame [11 x 5]
Groups: id_casa [2]
id_casa survdate hdinc hdinc_1 duration
(int) (fctr) (dbl) (dbl) (dbl)
1 802 1/19/2005 125.0000 0.0000 1
2 802 12/28/2005 142.8571 125.0000 2
3 802 1/4/2006 0.0000 142.8571 0
4 802 1/11/2006 0.0000 0.0000 0
5 802 1/18/2006 0.0000 0.0000 0
6 802 1/25/2006 142.8571 0.0000 1
7 802 2/1/2006 0.0000 142.8571 0
8 955 10/13/2004 50.0000 0.0000 1
9 955 10/20/2004 32.0000 50.0000 2
10 955 10/27/2004 159.0000 32.0000 3
11 955 11/3/2004 2.5000 159.0000 4
工作原理:首先我们使用rle
找到所有运行:
rle(dat$hdinc>0)
Run Length Encoding
lengths: int [1:5] 2 3 1 1 4
values : logi [1:5] TRUE FALSE TRUE FALSE TRUE
然后,我们使用seq
从rle
的0到每个长度lapply
:
z <- unlist(lapply(rle(dat$hdinc > 0)[["lengths"]], seq, from = 1))
z
[1] 1 2 1 2 3 1 1 1 2 3 4
然后我们根据疾病或健康来过滤它:
ifelse(dat$hdinc > 0, z, 0)
[1] 1 2 0 0 0 1 0 1 2 3 4
使用dplyr
group_by
我们确保自己在每个id_casa
上运行它。
编辑:在基地:
dat$duration2 <- ifelse(dat$hdinc > 0,
unlist(by(dat, dat$id_casa, FUN = function(x) unlist(lapply(rle(x$hdinc > 0)[["lengths"]], seq, from = 1)))),
0)
答案 1 :(得分:0)
仅使用基础R:
# create sample data
sampleData <-
structure(list(id_casa = c(802L, 802L, 802L, 802L, 802L, 802L, 802L, 955L, 955L, 955L, 955L),
survdate = structure(c(3L, 10L, 5L, 1L, 2L, 4L, 11L, 6L, 7L, 8L, 9L),
.Label = c("1/11/2006", "1/18/2006", "1/19/2005", "1/25/2006", "1/4/2006", "10/13/2004", "10/20/2004", "10/27/2004", "11/3/2004", "12/28/2005", "2/1/2006" ), class = "factor"),
hdinc = c(125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159, 2.5), hdinc_1 = c(0, 125, 142.85715, 0, 0, 0, 142.85715, 0, 50, 32, 159)),
.Names = c("id_casa", "survdate", "hdinc", "hdinc_1"), class = "data.frame", row.names = c(NA, -11L))
# you must be sure the rows are already ordered, otherwise you can use something like:
#sampleData <- sampleData[order(sampleData$id_casa,sampleData$survdate),]
sampleData$Duration <-
unlist(
by(sampleData,
INDICES=sampleData$id_casa,
FUN=function(house){
tail(Reduce(f=function(prv,nxt){if(nxt == 0) 0 else (prv+nxt)},
x=as.integer(house$hdinc > 0),init=0,accumulate=TRUE),-1)
}))
> sampleData
id_casa survdate hdinc hdinc_1 Duration
1 802 1/19/2005 125.0000 0.0000 1
2 802 12/28/2005 142.8571 125.0000 2
3 802 1/4/2006 0.0000 142.8571 0
4 802 1/11/2006 0.0000 0.0000 0
5 802 1/18/2006 0.0000 0.0000 0
6 802 1/25/2006 142.8571 0.0000 1
7 802 2/1/2006 0.0000 142.8571 0
8 955 10/13/2004 50.0000 0.0000 1
9 955 10/20/2004 32.0000 50.0000 2
10 955 10/27/2004 159.0000 32.0000 3
11 955 11/3/2004 2.5000 159.0000 4