Question

解释起来有点棘手，我会尽我所能，在下面查询。我有一个 df 如下。我需要根据国家/地区列中的最大流行率按组过滤行，但在上述组中尚未发生。（根据输出（图片），A没有出现在group2中的原因是因为它已经出现在了group 1中）

简而言之，我需要在 country 列中获取唯一值，同时在 pop 中获取最大值（在组级别）。我希望图片可以传达我无法传达的内容。（首选 Tidyverse 解决方案）

[![预期输出][2]][2]

df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L, 
100L)), class = "data.frame", row.names = c(NA, -9L))

Answer 1

我认为这行。语法说明

将数据拆分为每个组的列表
离开第一组（因为它将在下一步中用作 .init，但在过滤 pop 的最大值之后。
在此处使用 purrr::reduce 将小标题列表减少到单个小标题
reduce 中使用的迭代
- .init 用作过滤的第一组
- 此后通过 anti_join 删除之前组中的国家/地区
- 此数据再次过滤最大 pop
- 添加了 bind_rows() 之前过滤的国家
因此，最终我们将获得所需的 tibble。

df %>% group_split(Group) %>% .[-1] %>%
  reduce(.init =df %>% group_split(Group) %>% .[[1]] %>% 
               filter(pop == max(pop)), 
             ~ .y  %>%
               anti_join(.x, by = c("country" = "country")) %>% 
               filter(pop == max(pop)) %>%
               bind_rows(.x) %>% arrange(Group)) 

# A tibble: 3 x 3
  Group country   pop
  <int> <chr>   <int>
1     1 A         200
2     2 E         150
3     3 G         100

Answer 2

您可以创建一个辅助函数，将每个组的最大弹出量写入向量中，并使用它来过滤数据框。

library(tidyverse)
max_values <- c()

helper <- function(dat, ...){
  dat <- dat[!(dat %in% max_values)] # exclude maximum values from previous groups
  max_value <- max(dat) # get current max. value
  max_values <<- c(max_values, max_value) # append 
  return(max_value)
}

df %>% 
  group_by(Group) %>% 
  filter(pop == helper(pop))

这给了你：

# A tibble: 3 x 3
# Groups:   Group [3]
  Group country   pop
  <int> <chr>   <int>
1     1 A         200
2     2 E         150
3     3 H         120

使用的数据：

> df
   Group country pop
1      1       A 200
2      1       B 100
3      1       C  50
4      2       A 200
5      2       E 150
6      2       F 120
7      3       A 200
8      3       E 150
9      3       G 100
10     3       H 120

Answer 3

这是另一种可能性，但是 过于简化，因为它没有考虑一个群体在一个群体中拥有更多人口的可能性它没有赢。

library(dplyr)
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L, 
100L)), class = "data.frame", row.names = c(NA, -9L))

df %>% 
  group_by(country) %>% 
  summarize(popmax = max(pop))  %>% 
  inner_join(df, by = c("popmax" = 'pop')) %>% 
  rename(country = country.y) %>% 
  select(-country.x) %>% 
  group_by(country) %>% 
  arrange(Group) %>% 
  slice(1) %>% 
  ungroup() %>% 
  group_by(Group) %>% 
  arrange(country) %>% 
  slice(1) %>%  
  select(Group, country, popmax) %>% 
  rename(pop = popmax)

我的答案失败了（而其他答案没有）这个数据集：

df <- tribble(
  ~Group, ~ country, ~pop,
  1     ,         'A',    200,
  1     ,         'B',    100,
  1     ,         'C',     50,
  1     ,         'G',    150,
  2     ,         'A',    200,
  2     ,         'E',    150,
  2     ,         'F',    120,
  3     ,         'A',    200,
  3     ,         'E',    150,
  3     ,         'G',    100
)

Answer 4

更新 @Crestor 声称我的答案不正确。

我的回答是正确的，因为我的代码按照 OP 的要求提供了所需的输出。
您反对我的代码在其他情况下不起作用的观点可能是正确的，但在这种情况下它无关紧要，因为我的回答只是为了解决手头的任务。
以下是针对此数据集提出的场景的答案：

df1 <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), 
    country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), 
    pop = c(200L, 100L, 250L, 220L, 150L, 120L, 200L, 150L, 100L
    )), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))

Crestor 的预期输出：

# A tibble: 3 x 3
  Group country   pop
  <int> <chr>   <int>
1     1 C         250
2     2 A         220
3     3 E         150

我为你的场景@crestor 编写的代码

library(dplyr)

df1 %>% 
group_by(country) %>% 
arrange(Group) %>% 
filter(pop == max(pop)) %>% 
group_by(Group) %>% 
filter(pop == max(pop))

输出：

  Group country   pop
  <int> <chr>   <int>
1     1 C         250
2     2 A         220
3     3 E         150

OP 对问题的原始回答

为简单起见：首先arrange使您的数据集就位。然后 group_by 并用 slice 保留每组中的第一行。然后 group_by Group 和 filter 最大 pop

library(dplyr)
df %>% 
  arrange(country, pop) %>% 
  group_by(country) %>% 
  slice(1) %>% 
  group_by(Group) %>% 
  filter(pop==max(pop))

输出：

  Group country   pop
  <int> <chr>   <int>
1     1 A         200
2     2 E         150
3     3 G         100

根据一列中的最大值和唯一值过滤行

4 个答案: