我有按主题(' id')分组的时间序列数据,这些数据会保留在特定的网站上。并有一定的阶段'在每个'时间'步骤
有时,受试者从一个站点切换到另一个站点,并可能再次回到原点。如果个人将网站来回切换(例如,从网站'到网站' b',然后返回网站' a' )和如果中间网站只有一个注册(在转换aba中,那么网站' b'将被视为' ;中间网站')和个人处于特定阶段(此处,阶段= 2)位于中间网站,然后我希望在此时间删除注册。
我的虚拟数据由四个主题组成。其中三个(主题1-3)已从站点a移至b,然后返回站点b,其中一个已从a移至b。
前两个科目都在中间站点进行了一次注册。主题1位于中间站点的第1阶段,我希望保留该注册。另一方面,对象2位于中间站点的第2阶段,应该删除该注册。主题3也在a和b之间来回移动。但是,虽然它位于中间站点b的第2阶段,但它有两个注册,并且两个注册都保留。主题4已从站点a移至b,但不再返回。因此,尽管它位于站点b的第2阶段,但站点b上的注册不是中间站点'应该保留。
数据:
df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4),
time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L),
site = c("a", "b", "a", "a", "b", "a", "a", "b", "b", "a", "a", "b"),
stage = c(1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2)),
.Names = c("id", "time", "site", "stage"),
row.names = c(NA, -12L), class = "data.frame")
df
# id time site stage
# 1 1 1 a 1
# 2 1 2 b 1 <~~ A single middle registration on site 2
# 3 1 3 a 1 However, the individual is in stage 1: -> keep
# 4 2 1 a 1
# 5 2 2 b 2 <~~ A single middle registration on site 2 with stage 2: -> remove
# 6 2 3 a 1
# 7 3 1 a 1
# 8 3 2 b 2 <~~ Two middle registrations with stage 2: -> keep both rows
# 9 3 3 b 2 <~~
# 10 3 4 a 1
# 11 4 1 a 1
# 12 4 2 b 2 <~~ A single registration on site 2 with stage 2,
# but it is not in between two sites: -> keep
因此,在测试数据中,只有在id = 2的时间= 2的注册才应该被删除。
以前,我使用plyr::ddply
和rle
的结果来解决问题:
对于每个人,计算站点的运行长度(rle(x$site)
)
如果:
- 在站点之间来回(例如从a到b,然后回到a)
(length(r$values) > 2
)&amp;
- 中间站点只有一个注册(r$lengths[2] == 1
)&amp;
- 中间站点的阶段是2(x$stage[x$site == r$values[2]][1] == 2
)
然后:删除中间网站x[!(x$site == r$values[2]), ]
上的注册
library(plyr)
ddply(df, .(id), function(x){
r <- rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & x$stage[x$site == r$values[2]][1] == 2){
x[x$site != r$values[2], ]
} else x
})
# id time site stage
# 1 1 1 a 1
# 2 1 2 b 1
# 3 1 3 a 1
# 4 2 1 a 1 <~~ the single middle site with stage = 2 at time 2 is removed
# 5 2 3 a 1 <~~
# 6 3 1 a 1
# 7 3 2 b 2
# 8 3 3 b 2
# 9 3 4 a 1
# 10 4 1 a 1
# 11 4 2 b 2
detach("package:plyr")
现在我在dplyr
中遇到这个问题时遇到了一些麻烦。我在SO(例如this和this)以及github(this和this)上找到了一些相关的帖子,但我无法根据自己的需要调整它们。以下是一些绝望的尝试:
library(dplyr)
df %>%
group_by(id) %>%
do((function(x){
r = rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
filter(x, x$site != r$values[2])
} else x
})(.))
# desired row is not removed
df %>%
group_by(id) %>%
do(function(x){
r = rle(x$site)
if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
x[!(x$site == r$values[2]), ]
} else x
})
# Error: Results are not data frames at positions: 1, 2, 3
这种尝试恰好起作用(与上面的ddply
给出相同的结果),但是远非优雅,我怀疑它的正确方式&#39;:
df %>%
group_by(id) %>%
do(r = rle(.$site)) %>%
do(data.frame(id = .$id,
len = length(.$r$values),
site = .$r$values[2],
len2 = .$r$lengths[2])) %>%
filter(len == 3, len2 == 1) %>%
select(-len) %>%
left_join(df, ., by = c("id", "site")) %>%
filter(!(len2 %in% 1 & stage == 2)) %>%
select(-len2)
如何正确地do
这个? WWHWD?
答案 0 :(得分:3)
我不确定我是否完全理解您的代码背后的逻辑,但这可能是获得相同结果的另一种方式,可能需要进行一些修改:
df %>%
group_by(id) %>%
group_by(grp = cumsum(abs(c(1, diff(as.numeric(site))))), add = TRUE) %>%
filter(!(grp == 2 & n() == 1 & stage == 2))
#Source: local data frame [9 x 5]
#Groups: id, grp
#
# id time site stage grp
#1 1 1 a 1 1
#2 1 2 b 1 2
#3 1 3 a 1 3
#4 2 1 a 1 1 <~~ row in between
#5 2 3 a 1 3 <~~ was removed
#6 3 1 a 1 1
#7 3 2 b 2 2
#8 3 3 b 2 2
#9 3 4 a 1 3
这种方法假定&#34;中间组&#34;永远是第二个&#34; grp&#34;。
创建一个函数可能会更好 - 我会调用intergroup()
,因为它在分组数据中创建了组,并使用它:
intergroup <- function(var, start = 1) {
cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}
df %>%
group_by(id) %>%
group_by(grp = intergroup(site), add = TRUE) %>%
filter(!(grp == 2 & n() == 1 & stage == 2))
OP问题更新后编辑。
针对调整后的问题,请尝试以下调整后的代码:
df %>%
group_by(id) %>%
mutate(z = lag(site, 1) != lead(site, 1)) %>% # check if site before and after are not the same
group_by(grp = intergroup(site), add = TRUE) %>%
filter(!(grp == 2 & n() == 1 & stage == 2 & !is.na(z))) %>% # check for NA in z
ungroup() %>% select(-c(z, grp))
#Source: local data frame [11 x 4]
#
# id time site stage
#1 1 1 a 1
#2 1 2 b 1
#3 1 3 a 1
#4 2 1 a 1
#5 2 3 a 1
#6 3 1 a 1
#7 3 2 b 2
#8 3 3 b 2
#9 3 4 a 1
#10 4 1 a 1
#11 4 2 b 2 <~~ row is kept
答案 1 :(得分:2)
这是rle
替代品,不依赖于do
。该代码的灵感来自@akrun的this answer(在我的问题之后发布;感谢@beginneR的提升)。
df %>%
group_by(id) %>%
mutate(site_idx = with(rle(site),
rep(x = seq_along(lengths), times = lengths))) %>%
filter(!(n_distinct(site_idx) > 2 & sum(site_idx == 2) == 1 &
site_idx == 2 & stage == 2)) %>%
select(-site_idx)
# id time site stage
# 1 1 1 a 1
# 2 1 2 b 1
# 3 1 3 a 1
# 4 2 1 a 1 <~~ the single middle site with stage = 2 at time 2 has been removed
# 5 2 3 a 1 <~~
# 6 3 1 a 1
# 7 3 2 b 2
# 8 3 3 b 2
# 9 3 4 a 1
# 10 4 1 a 1
# 11 4 2 b 2