我有一个看起来像这样的数据框:
# A tibble: 15 x 5
group name sum count max_elements
<int> <fct> <int> <int> <int>
1 1 aaa 3 2 4
2 1 bbb 3 1 4
3 1 ccc 2 2 4
4 1 ddd 2 2 4
5 1 eee 1 0 4
6 2 aaa 3 2 3
7 2 bbb 3 1 3
8 2 ccc 2 3 3
9 2 ddd 2 1 3
10 3 aaa 3 4 4
11 3 bbb 3 2 4
12 3 ccc 2 5 4
13 3 ddd 2 1 4
14 3 eee 2 1 4
15 3 fff 2 1 4
我想按照这种决策推理标记每个观察结果:
为每个名称创建一个标签,如下所示:
selected
,如果名称在第n个元素的最大阈值内具有较高的总和和较高的计数。pick_random
,如果多个名称在最大n个元素阈值内具有相同的总和和相同的计数。not_selected
(如果它在“竞赛”之外)例如对于group 1
,结果将是:
# A tibble: 5 x 6
group name decision sum count max_elements
<int> <fct> <fct> <int> <int> <int>
1 1 aaa selected 3 2 4
2 1 bbb selected 3 1 4
3 1 ccc pick_random 2 2 4
4 1 ddd pick_random 2 2 4
5 1 eee selected 1 0 4
对于group 2
,没有随机选择,因为所有名称的得分均不超过最大大小。
# A tibble: 4 x 6
group name decision sum count max_elements
<int> <fct> <fct> <int> <int> <int>
1 2 aaa selected 3 2 3
2 2 bbb selected 3 1 3
3 2 ccc selected 2 3 3
4 2 ddd not_selected 2 1 3
代替group 3
:
# A tibble: 6 x 6
group name decision sum count max_elements
<int> <fct> <fct> <int> <int> <int>
1 3 aaa selected 3 4 4
2 3 bbb selected 3 2 4
3 3 ccc selected 2 5 4
4 3 ddd pick_random 2 1 4
5 3 eee pick_random 2 1 4
6 3 fff pick_random 2 1 4
最终输出df如下:
# A tibble: 15 x 6
group name decision sum count max_elements
<int> <fct> <fct> <int> <int> <int>
1 1 aaa selected 3 2 4
2 1 bbb selected 3 1 4
3 1 ccc pick_random 2 2 4
4 1 ddd pick_random 2 2 4
5 1 eee selected 1 0 4
6 2 aaa selected 3 2 3
7 2 bbb selected 3 1 3
8 2 ccc selected 2 3 3
9 2 ddd not_selected 2 1 3
10 3 aaa selected 3 4 4
11 3 bbb selected 3 2 4
12 3 ccc selected 2 5 4
13 3 ddd pick_random 2 1 4
14 3 eee pick_random 2 1 4
15 3 fff pick_random 2 1 4
可再现的df:
structure(list(group = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L), name = structure(c(1L, 2L, 3L, 4L, 5L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("aaa", "bbb",
"ccc", "ddd", "eee", "fff"), class = "factor"), sum = c(3L, 3L,
2L, 2L, 1L, 3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L), count = c(2L,
1L, 2L, 2L, 0L, 2L, 1L, 3L, 1L, 4L, 2L, 5L, 1L, 1L, 1L), max_elements = c(4L,
4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L)), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
到目前为止,我仍尝试安排和使用top_n。 但是我不知道如何标记多个观察值相同且总数相同的情况。
df %>%
group_by(group) %>%
arrange(-sum, -count) %>%
top_n(as.integer(max_elements))
答案 0 :(得分:0)
这是我为您解决问题的尝试。我们可以使用rleid
包中的data.table
来创建Rank
列。之后,我们可以使用case_when
根据条件分配标签。请注意,在应用此列之前,以正确的顺序排列列很重要。看来您已经做到了。如果不是,请添加arrange(group, name, sum, count)
作为第一个管道操作。
library(tidyverse)
library(data.table)
dat2 <- dat %>%
group_by(group, sum, count) %>%
add_count() %>%
group_by(group) %>%
mutate(Rank = rleid(sum, count)) %>%
mutate(decision = case_when(
n > 1 & Rank <= max_elements ~ "pick_random",
Rank > max_elements ~ "not_selected",
TRUE ~ "selected",
)) %>%
ungroup() %>%
select(group, name, decision, sum, count, max_elements) %>%
mutate(decision = factor(decision))
dat2
# # A tibble: 15 x 6
# group name decision sum count max_elements
# <int> <fct> <fct> <int> <int> <int>
# 1 1 aaa selected 3 2 4
# 2 1 bbb selected 3 1 4
# 3 1 ccc pick_random 2 2 4
# 4 1 ddd pick_random 2 2 4
# 5 1 eee selected 1 0 4
# 6 2 aaa selected 3 2 3
# 7 2 bbb selected 3 1 3
# 8 2 ccc selected 2 3 3
# 9 2 ddd not_selected 2 1 3
# 10 3 aaa selected 3 4 4
# 11 3 bbb selected 3 2 4
# 12 3 ccc selected 2 5 4
# 13 3 ddd pick_random 2 1 4
# 14 3 eee pick_random 2 1 4
# 15 3 fff pick_random 2 1 4