R数据框:通过对一列进行排序并基于两列中的值来选择行

时间:2018-10-21 00:34:22

标签: r dataframe dplyr

我有一个数据框df,如下所示:

user_id     rating      date          status
10506       4           2008-11-11    2
10506       3           2008-11-13    1
10506       4           2008-11-23    3
10506       2           2008-11-29    4
10506       1           2009-01-15    3
10506       1           2009-11-11    2
10507       3           2007-10-20    1
10507       5           2007-11-11    1
10507       2           2007-12-21    2
10507       5           2008-01-08    3
10507       4           2008-01-31    3
10507       3           2008-02-05    4
10507       4           2008-03-10    2

我想执行以下两项操作:

  1. user_id最早的每个date选择三行。我知道所有user_id都至少有三个观察结果。 date不是日期格式,但是当我按date订购时,我可以按时间顺序进行订购。

  2. 为每个user_id(最早的date,其中status是3或4)选择三行。

是否有任何dplyr解决方案,我可以按user_id分组,然后按升序对date排序后选择前三行?任何帮助表示赞赏。

编辑:

我更正了我在问题中提供的虚拟数据中的错字。对此错误表示歉意。另外,包括一个预期的输出以使事情变得清楚:

第1部分的输出:

user_id     rating      date          status
10506       4           2008-11-11    2
10506       3           2008-11-13    1
10506       4           2008-11-23    3
10507       3           2007-10-20    1
10507       5           2007-11-11    1
10507       2           2007-12-21    2

第2部分的输出:

user_id     rating      date          status
10506       4           2008-11-23    3
10506       2           2008-11-29    4
10506       1           2009-01-15    3
10507       5           2008-01-08    3
10507       4           2008-01-31    3
10507       3           2008-02-05    4

2 个答案:

答案 0 :(得分:1)

  1. 您知道如何进行group_by(user_id)arrange(date)
    • 我认为在您的程序中,可以先进行filter(status == 3 | status == 4)
    • 到{3}的子集status
  2. 现在你有
    1. 每个user_id
    2. date被安排了,
    3. 并且status是3或4
    4. 因此,您只需slice(1:3):每个组的前三行

依次使用%>%,您可以轻松获得结果。

library(tidyverse)

df <-
  tribble(
    ~user_id, ~rating, ~date, ~status,
    10506, 4, "2008-11-11", 2,
    10506, 3, "2008-11-13", 1,
    10506, 4, "2008-11-23", 3,
    10506, 2, "2008-11-29", 4,
    10506, 1, "2009-01-15", 3,
    10506, 1, "2009-11-11", 2,
    10507, 3, "2007-10-20", 1,
    10507, 5, "2007-11-11", 1,
    10507, 2, "2007-12-21", 2,
    10507, 5, "2008-01-08", 3,
    10507, 4, "2008-01-31", 3,
    10507, 3, "2008-02-05", 4,
    10507, 4, "2008-03-10", 2
  )

# dplyr solution
df %>%
  filter(status == 3 | status == 4) %>%
  group_by(user_id) %>%
  arrange(date) %>%
  slice(1:3)

#> # A tibble: 6 x 4
#> # Groups:   user_id [2]
#>   user_id rating date       status
#>     <dbl>  <dbl> <chr>       <dbl>
#> 1   10506      4 2008-11-23      3
#> 2   10506      2 2008-11-29      4
#> 3   10506      1 2009-01-15      3
#> 4   10507      5 2008-01-08      3
#> 5   10507      4 2008-01-31      3
#> 6   10507      3 2008-02-05      4

答案 1 :(得分:0)

这应该可以解决问题...

library(dplyr)
df <- tribble(
~user_id, ~rating,  ~date,  ~status,
10506, 4, "2008-11-11",    2,
10506, 3, "2008-11-13",    1,
10506, 4, "2008-11-23",    3,
10506, 2, "2008-11-29",    4,
10506, 1, "2009-01-15",    3,
10506, 1, "2009-11-11",    2,
10507, 3, "2007-10-20",    1,
10507, 5, "2007-11-11",    1,
10507, 2, "2007-12-21",    2,
10507, 5, "2008-01-08",    3,
10507, 4, "2008-01-31",    3,
10507, 3, "2008-02-05",    4,
10507, 4, "2008-03-10",    2
)

Part1<- df %>% 
  group_by(user_id) %>%
  arrange(date,.by_group = TRUE) %>%
  mutate(seq=row_number()) %>%
  filter(seq <=3) %>%
  select(-seq)


Part2<- df %>% 
  filter(status %in% c(3,4)) %>%
  group_by(user_id) %>%
  arrange(date,.by_group = TRUE) %>%
  mutate(seq=row_number()) %>%
  filter(seq <=3) %>%
  select(-seq)