Question

我有一个数据帧，我想过滤掉符合某些条件的行，然后是后续的N行。例如，考虑一个包含小时和分钟列（表示每行时间戳）的数据框。让我们说我想要在第0和第6小时后的前两个记录。有可能以一种很好的方式做到这一点吗？

set.seed(3) df <- data.frame(hour = 0:11, minutes = runif(12, 0, 59), count = rpois(12, 3)) %>% arrange(hour, minutes)

产生

> df hour minutes count 1 0 9.914450 3 2 1 47.643468 3 3 2 22.711599 5 4 3 19.336325 5 5 4 35.523940 1 6 5 35.659249 4 7 6 7.353373 5 8 7 17.381455 2 9 8 34.078985 2 10 9 37.227777 0 11 10 30.208938 1 12 11 29.796411 1

普通过滤器返回两行：

> df %>% + filter(hour%%6 == 0) hour minutes count 1 0 9.914450 3 2 6 7.353373 5

然而，答案应该是：

hour minutes count 1 0 9.914450 3 2 1 47.643468 3 3 6 7.353373 5 4 7 17.381455 2

在这种情况下，可以对用于过滤的列使用模运算，但在一般情况下，这可能是不可能的。

下面提供了原始示例，在这里我想要每小时的前两个记录。在这种情况下，Akrun的答案很好并且利用了数据中的组结构。 E.g。

library(dplyr) set.seed(0) df <- data.frame(hour = rep(0:11, 3), minutes = runif(36, 0, 59), count = rpois(36, 3)) %>% arrange(hour, minutes)

看起来像：

hour minutes count 1 0 7.4077507 2 2 0 10.4168484 3 3 0 52.9051348 4 4 1 15.6650111 4 5 1 15.7660195 5 6 1 40.5343480 4 7 2 21.9553101 1 8 2 22.6621194 4 9 2 22.7807315 2 10 3 0.7900297 3 11 3 33.7983484 4 12 3 45.4206438 3 ...

一个人可以做到

df %>% mutate(is_even_hour = ifelse(hour %% 2 == 0, 1, 0)) %>% filter(is_even_hour == 1) %>% group_by(hour, is_even_hour) %>% filter(row_number() <= 2) %>% ungroup %>% select(-is_even_hour)

给出了

hour minutes count <int> <dbl> <int> 1 0 7.407751 2 2 0 10.416848 3 3 2 21.955310 1 4 2 22.662119 4 5 4 22.560889 2 6 4 29.364255 5 7 6 20.080591 2 8 6 53.004991 3 9 8 35.374384 4 10 8 38.987070 3 11 10 3.645390 4 12 10 10.986838 5

Answer 1

我可以使用base R来考虑这个sapply解决方案。

基本上，我们的想法是找出完全可被6整除的索引，然后使用seq生成下一个要选择的索引。

所以在这里你想要在每个索引length.out之后的2行是2，如果将来你想要更多（如评论中所提到的）你可以将其更改为你想要的任何数字。

y <- which(df$hour%%6 == 0)
df[sapply(y, function(x) seq(x, length.out = 2)), ]

#    hour minutes   count
#1    0  9.914450      3
#2    1  47.643468     3
#7    6  7.353373      5
#8    7  17.381455     2

Answer 2

按“小时”分组后，我们可以在一个filter步骤中执行此操作

df %>%
     group_by(hour) %>%
     filter(!hour%%2 & row_number() <3)
#     hour   minutes count
#    <int>     <dbl> <int>
#1      0  7.407751     2
#2      0 10.416848     3
#3      2 21.955310     1
#4      2 22.662119     4
#5      4 22.560889     2
#6      4 29.364255     5
#7      6 20.080591     2
#8      6 53.004991     3
#9      8 35.374384     4
#10     8 38.987070     3
#11    10  3.645390     4
#12    10 10.986838     5

更新后的帖子

i1 <- df %>% 
          filter(hour%%6 == 0) %>%
          .$hour %>% 
          rep(., each =2)+ 0:1 %>% 
          match(., df$hour) 
df[i1,]
#   hour   minutes count
#1    0  9.914450     3
#2    1 47.643468     3
#7    6  7.353373     5
#8    7 17.381455     2

或者可以使用data.table

以紧凑的方式完成此操作

library(data.table)
setDT(df)[df[, rep(which(!hour%%6), each = 2) + 0:1 ]]
#   hour   minutes count
#1:    0  9.914450     3
#2:    1 47.643468     3
#3:    6  7.353373     5
#4:    7 17.381455     2

Answer 3

一种可能的简单解决方案（在基础R，dplyr和amp.data.table中实现）：

# with base R:
df[which(df$hour %% 6 < 2),]

# with dplyr:
df %>% filter(hour %% 6 < 2)

# with data.table:
setDT(df)[which(df$hour %% 6 < 2)]
# or with .I instead of 'which':
setDT(df)[df[,.I[hour %% 6 < 2]]]

正如@Alex严重指出的那样，上述解决方案在没有第7小时的情况下将无法提供正确的输出。您可以使用@akrun显示的rep和+ 0:1方法调整代码：

# with base R:
df[rep(which(df$hour %% 6 == 0), each = 2) + 0:1,]

# with dplyr (works also with 'filter' instead of 'slice'):
df %>% slice(rep(which(hour %% 6 == 0), each = 2) + 0:1)

# with data.table
setDT(df)[df[, rep(.I[hour %% 6 == 0], each = 2) + 0:1]]

使用dplyr过滤窗口：找到匹配的行，并保留后续的N行

3 个答案: