使用dplyr :: do的匿名函数 - 使用rle的结果来过滤数据

时间:2014-11-28 08:51:34

标签: r dplyr

我有按主题(' id')分组的时间序列数据,这些数据会保留在特定的网站上。并有一定的阶段'在每个'时间'步骤

有时,受试者从一个站点切换到另一个站点,并可能再次回到原点。如果个人将网站来回切换(例如,从网站'到网站' b',然后返回网站' a' )如果中间网站只有一个注册(在转换aba中,那么网站' b'将被视为' ;中间网站')个人处于特定阶段(此处,阶段= 2)位于中间网站,然后我希望在此时间删除注册。

我的虚拟数据由四个主题组成。其中三个(主题1-3)已从站点a移至b,然后返回站点b,其中一个已从a移至b。

前两个科目都在中间站点进行了一次注册。主题1位于中间站点的第1阶段,我希望保留该注册。另一方面,对象2位于中间站点的第2阶段,应该删除该注册。主题3也在a和b之间来回移动。但是,虽然它位于中间站点b的第2阶段,但它有两个注册,并且两个注册都保留。主题4已从站点a移至b,但不再返回。因此,尽管它位于站点b的第2阶段,但站点b上的注册不是中间站点'应该保留。

数据:

df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4),
                     time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L),
                     site = c("a", "b", "a", "a", "b", "a", "a", "b", "b", "a", "a", "b"),
                     stage = c(1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2)),
                     .Names = c("id", "time", "site", "stage"),
                row.names = c(NA, -12L), class = "data.frame")

df

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1 <~~ A single middle registration on site 2
# 3   1    3    a     1     However, the individual is in stage 1: -> keep 

# 4   2    1    a     1
# 5   2    2    b     2 <~~ A single middle registration on site 2 with stage 2: -> remove
# 6   2    3    a     1

# 7   3    1    a     1
# 8   3    2    b     2 <~~ Two middle registrations with stage 2: -> keep both rows 
# 9   3    3    b     2 <~~
# 10  3    4    a     1

# 11  4    1    a     1 
# 12  4    2    b     2 <~~ A single registration on site 2 with stage 2,
#                            but it is not in between two sites: -> keep

因此,在测试数据中,只有在id = 2的时间= 2的注册才应该被删除。

以前,我使用plyr::ddplyrle的结果来解决问题:

对于每个人,计算站点的运行长度(rle(x$site)
如果
  - 在站点之间来回(例如从a到b,然后回到a)    (length(r$values) > 2&amp;
  - 中间站点只有一个注册(r$lengths[2] == 1&amp;
  - 中间站点的阶段是2(x$stage[x$site == r$values[2]][1] == 2
然后:删除中间网站x[!(x$site == r$values[2]), ]上的注册

library(plyr)

ddply(df, .(id), function(x){
  r <- rle(x$site)
  if(length(r$values) > 2 & r$lengths[2] == 1 & x$stage[x$site == r$values[2]][1] == 2){
    x[x$site != r$values[2], ]
  } else x
})

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1
# 3   1    3    a     1

# 4   2    1    a     1 <~~ the single middle site with stage = 2 at time 2 is removed
# 5   2    3    a     1 <~~

# 6   3    1    a     1
# 7   3    2    b     2
# 8   3    3    b     2
# 9   3    4    a     1

# 10  4    1    a     1
# 11  4    2    b     2

detach("package:plyr")

现在我在dplyr中遇到这个问题时遇到了一些麻烦。我在SO(例如thisthis)以及github(thisthis)上找到了一些相关的帖子,但我无法根据自己的需要调整它们。以下是一些绝望的尝试:

library(dplyr)

df %>%
  group_by(id) %>%
  do((function(x){
    r = rle(x$site)
    if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
    filter(x, x$site != r$values[2])
  } else x
})(.))
# desired row is not removed

df %>%
  group_by(id) %>%
  do(function(x){
    r = rle(x$site)
    if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
      x[!(x$site == r$values[2]), ]
    } else x
  })
# Error: Results are not data frames at positions: 1, 2, 3

这种尝试恰好起作用(与上面的ddply给出相同的结果),但是远非优雅,我怀疑它的正确方式&#39;:

df %>%
  group_by(id) %>%
  do(r = rle(.$site)) %>%  
  do(data.frame(id = .$id,
                len = length(.$r$values),
                site = .$r$values[2],
                len2 = .$r$lengths[2])) %>%
  filter(len == 3, len2 == 1) %>%
  select(-len) %>%
  left_join(df, ., by = c("id", "site")) %>%
  filter(!(len2 %in% 1 & stage == 2)) %>%
  select(-len2)

如何正确地do这个? WWHWD?

2 个答案:

答案 0 :(得分:3)

我不确定我是否完全理解您的代码背后的逻辑,但这可能是获得相同结果的另一种方式,可能需要进行一些修改:

df %>% 
  group_by(id) %>%
  group_by(grp = cumsum(abs(c(1, diff(as.numeric(site))))), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2))

#Source: local data frame [9 x 5]
#Groups: id, grp
#
#  id time site stage grp
#1  1    1    a     1   1
#2  1    2    b     1   2
#3  1    3    a     1   3
#4  2    1    a     1   1     <~~ row in between 
#5  2    3    a     1   3     <~~ was removed
#6  3    1    a     1   1
#7  3    2    b     2   2
#8  3    3    b     2   2
#9  3    4    a     1   3

这种方法假定&#34;中间组&#34;永远是第二个&#34; grp&#34;。


创建一个函数可能会更好 - 我会调用intergroup(),因为它在分组数据中创建了组,并使用它:

intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

df %>% 
  group_by(id) %>%
  group_by(grp = intergroup(site), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2))

OP问题更新后编辑。

针对调整后的问题,请尝试以下调整后的代码:

df %>% 
  group_by(id) %>%
  mutate(z = lag(site, 1) != lead(site, 1)) %>%   # check if site before and after are not the same
  group_by(grp = intergroup(site), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2 & !is.na(z))) %>%  # check for NA in z
  ungroup() %>% select(-c(z, grp))  

#Source: local data frame [11 x 4]
#
#   id time site stage
#1   1    1    a     1
#2   1    2    b     1
#3   1    3    a     1
#4   2    1    a     1
#5   2    3    a     1
#6   3    1    a     1
#7   3    2    b     2
#8   3    3    b     2
#9   3    4    a     1
#10  4    1    a     1
#11  4    2    b     2    <~~ row is kept

答案 1 :(得分:2)

这是rle替代品,不依赖于do。该代码的灵感来自@akrun的this answer(在我的问题之后发布;感谢@beginneR的提升)。

df %>%
  group_by(id) %>%
  mutate(site_idx = with(rle(site),
                           rep(x = seq_along(lengths), times = lengths))) %>%
  filter(!(n_distinct(site_idx) > 2 & sum(site_idx == 2) == 1 &
           site_idx == 2 & stage == 2)) %>%
  select(-site_idx)

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1
# 3   1    3    a     1
# 4   2    1    a     1 <~~ the single middle site with stage = 2 at time 2 has been removed
# 5   2    3    a     1 <~~
# 6   3    1    a     1
# 7   3    2    b     2
# 8   3    3    b     2
# 9   3    4    a     1
# 10  4    1    a     1
# 11  4    2    b     2