我有一个数据框df
,如下所示:
user_id rating date status
10506 4 2008-11-11 2
10506 3 2008-11-13 1
10506 4 2008-11-23 3
10506 2 2008-11-29 4
10506 1 2009-01-15 3
10506 1 2009-11-11 2
10507 3 2007-10-20 1
10507 5 2007-11-11 1
10507 2 2007-12-21 2
10507 5 2008-01-08 3
10507 4 2008-01-31 3
10507 3 2008-02-05 4
10507 4 2008-03-10 2
我想执行以下两项操作:
为user_id
最早的每个date
选择三行。我知道所有user_id
都至少有三个观察结果。 date
不是日期格式,但是当我按date
订购时,我可以按时间顺序进行订购。
为每个user_id
(最早的date
,其中status
是3或4)选择三行。
是否有任何dplyr
解决方案,我可以按user_id
分组,然后按升序对date
排序后选择前三行?任何帮助表示赞赏。
编辑:
我更正了我在问题中提供的虚拟数据中的错字。对此错误表示歉意。另外,包括一个预期的输出以使事情变得清楚:
第1部分的输出:
user_id rating date status
10506 4 2008-11-11 2
10506 3 2008-11-13 1
10506 4 2008-11-23 3
10507 3 2007-10-20 1
10507 5 2007-11-11 1
10507 2 2007-12-21 2
第2部分的输出:
user_id rating date status
10506 4 2008-11-23 3
10506 2 2008-11-29 4
10506 1 2009-01-15 3
10507 5 2008-01-08 3
10507 4 2008-01-31 3
10507 3 2008-02-05 4
答案 0 :(得分:1)
group_by(user_id)
和arrange(date)
filter(status == 3 | status == 4)
status
user_id
:date
被安排了,status
是3或4 slice(1:3)
:每个组的前三行 依次使用%>%
,您可以轻松获得结果。
library(tidyverse)
df <-
tribble(
~user_id, ~rating, ~date, ~status,
10506, 4, "2008-11-11", 2,
10506, 3, "2008-11-13", 1,
10506, 4, "2008-11-23", 3,
10506, 2, "2008-11-29", 4,
10506, 1, "2009-01-15", 3,
10506, 1, "2009-11-11", 2,
10507, 3, "2007-10-20", 1,
10507, 5, "2007-11-11", 1,
10507, 2, "2007-12-21", 2,
10507, 5, "2008-01-08", 3,
10507, 4, "2008-01-31", 3,
10507, 3, "2008-02-05", 4,
10507, 4, "2008-03-10", 2
)
# dplyr solution
df %>%
filter(status == 3 | status == 4) %>%
group_by(user_id) %>%
arrange(date) %>%
slice(1:3)
#> # A tibble: 6 x 4
#> # Groups: user_id [2]
#> user_id rating date status
#> <dbl> <dbl> <chr> <dbl>
#> 1 10506 4 2008-11-23 3
#> 2 10506 2 2008-11-29 4
#> 3 10506 1 2009-01-15 3
#> 4 10507 5 2008-01-08 3
#> 5 10507 4 2008-01-31 3
#> 6 10507 3 2008-02-05 4
答案 1 :(得分:0)
这应该可以解决问题...
library(dplyr)
df <- tribble(
~user_id, ~rating, ~date, ~status,
10506, 4, "2008-11-11", 2,
10506, 3, "2008-11-13", 1,
10506, 4, "2008-11-23", 3,
10506, 2, "2008-11-29", 4,
10506, 1, "2009-01-15", 3,
10506, 1, "2009-11-11", 2,
10507, 3, "2007-10-20", 1,
10507, 5, "2007-11-11", 1,
10507, 2, "2007-12-21", 2,
10507, 5, "2008-01-08", 3,
10507, 4, "2008-01-31", 3,
10507, 3, "2008-02-05", 4,
10507, 4, "2008-03-10", 2
)
Part1<- df %>%
group_by(user_id) %>%
arrange(date,.by_group = TRUE) %>%
mutate(seq=row_number()) %>%
filter(seq <=3) %>%
select(-seq)
Part2<- df %>%
filter(status %in% c(3,4)) %>%
group_by(user_id) %>%
arrange(date,.by_group = TRUE) %>%
mutate(seq=row_number()) %>%
filter(seq <=3) %>%
select(-seq)