通过在多个列上进行排名来标记观察结果,并考虑到联系

时间:2018-08-25 16:18:19

标签: r dplyr reshape tidyverse tidyr

我有一个看起来像这样的数据框:

# A tibble: 15 x 5
   group name    sum count max_elements
   <int> <fct> <int> <int>        <int>
 1     1 aaa       3     2            4
 2     1 bbb       3     1            4
 3     1 ccc       2     2            4
 4     1 ddd       2     2            4
 5     1 eee       1     0            4
 6     2 aaa       3     2            3
 7     2 bbb       3     1            3
 8     2 ccc       2     3            3
 9     2 ddd       2     1            3
10     3 aaa       3     4            4
11     3 bbb       3     2            4
12     3 ccc       2     5            4
13     3 ddd       2     1            4
14     3 eee       2     1            4
15     3 fff       2     1            4

我想按照这种决策推理标记每个观察结果:

  • 按组先对所有名称进行排序,然后再按计数
  • 对于每个组,请考虑max_elements值
  • 为每个名称创建一个标签,如下所示:

    • selected,如果名称在第n个元素的最大阈值内具有较高的总和和较高的计数。
    • pick_random,如果多个名称在最大n个元素阈值内具有相同的总和和相同的计数。
    • not_selected(如果它在“竞赛”之外)

例如对于group 1,结果将是:

# A tibble: 5 x 6
  group name  decision      sum count max_elements
  <int> <fct> <fct>       <int> <int>        <int>
1     1 aaa   selected        3     2            4
2     1 bbb   selected        3     1            4
3     1 ccc   pick_random     2     2            4
4     1 ddd   pick_random     2     2            4
5     1 eee   selected        1     0            4        

对于group 2,没有随机选择,因为所有名称的得分均不超过最大大小。

# A tibble: 4 x 6
  group name  decision       sum count max_elements
  <int> <fct> <fct>        <int> <int>        <int>
1     2 aaa   selected         3     2            3
2     2 bbb   selected         3     1            3
3     2 ccc   selected         2     3            3
4     2 ddd   not_selected     2     1            3

代替group 3

# A tibble: 6 x 6
  group name  decision          sum count max_elements
  <int> <fct> <fct>           <int> <int>        <int>
1     3 aaa   selected            3     4            4
2     3 bbb   selected            3     2            4
3     3 ccc   selected            2     5            4
4     3 ddd   pick_random         2     1            4
5     3 eee   pick_random         2     1            4
6     3 fff   pick_random         2     1            4

最终输出df如下:

# A tibble: 15 x 6
   group name  decision          sum count max_elements
   <int> <fct> <fct>           <int> <int>        <int>
 1     1 aaa   selected            3     2            4
 2     1 bbb   selected            3     1            4
 3     1 ccc   pick_random         2     2            4
 4     1 ddd   pick_random         2     2            4
 5     1 eee   selected            1     0            4
 6     2 aaa   selected            3     2            3
 7     2 bbb   selected            3     1            3
 8     2 ccc   selected            2     3            3
 9     2 ddd   not_selected        2     1            3
10     3 aaa   selected            3     4            4
11     3 bbb   selected            3     2            4
12     3 ccc   selected            2     5            4
13     3 ddd   pick_random         2     1            4
14     3 eee   pick_random         2     1            4
15     3 fff   pick_random         2     1            4

可再现的df:

structure(list(group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L), name = structure(c(1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("aaa", "bbb", 
"ccc", "ddd", "eee", "fff"), class = "factor"), sum = c(3L, 3L, 
2L, 2L, 1L, 3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L), count = c(2L, 
1L, 2L, 2L, 0L, 2L, 1L, 3L, 1L, 4L, 2L, 5L, 1L, 1L, 1L), max_elements = c(4L, 
4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L)), row.names = c(NA, 
-15L), class = c("tbl_df", "tbl", "data.frame"))

到目前为止,我仍尝试安排和使用top_n。 但是我不知道如何标记多个观察值相同且总数相同的情况。

df %>%
  group_by(group) %>%
  arrange(-sum, -count) %>%
  top_n(as.integer(max_elements))

1 个答案:

答案 0 :(得分:0)

这是我为您解决问题的尝试。我们可以使用rleid包中的data.table来创建Rank列。之后,我们可以使用case_when根据条件分配标签。请注意,在应用此列之前,以正确的顺序排列列很重要。看来您已经做到了。如果不是,请添加arrange(group, name, sum, count)作为第一个管道操作。

library(tidyverse)
library(data.table)

dat2 <- dat %>%
  group_by(group, sum, count) %>%
  add_count() %>%
  group_by(group) %>%
  mutate(Rank = rleid(sum, count)) %>%
  mutate(decision = case_when(
    n > 1 & Rank <= max_elements   ~ "pick_random",
    Rank > max_elements            ~ "not_selected",
    TRUE                           ~ "selected",
  )) %>%
  ungroup() %>%
  select(group, name, decision, sum, count, max_elements) %>%
  mutate(decision = factor(decision))
dat2
# # A tibble: 15 x 6
#    group name  decision       sum count max_elements
#    <int> <fct> <fct>        <int> <int>        <int>
#  1     1 aaa   selected         3     2            4
#  2     1 bbb   selected         3     1            4
#  3     1 ccc   pick_random      2     2            4
#  4     1 ddd   pick_random      2     2            4
#  5     1 eee   selected         1     0            4
#  6     2 aaa   selected         3     2            3
#  7     2 bbb   selected         3     1            3
#  8     2 ccc   selected         2     3            3
#  9     2 ddd   not_selected     2     1            3
# 10     3 aaa   selected         3     4            4
# 11     3 bbb   selected         3     2            4
# 12     3 ccc   selected         2     5            4
# 13     3 ddd   pick_random      2     1            4
# 14     3 eee   pick_random      2     1            4
# 15     3 fff   pick_random      2     1            4