Question

我有一个data.frame，我想总结一下，以获得每列的最高（5个值）和最低（5个值）。我使用iris作为可重现的示例。

iris中所有变量的最高5个值可以使用

获得

df_h <-  iris %>% 
  dplyr::select(Species, everything()) %>% 
  tidyr::gather("id", "value", 2:5) %>% 
  dplyr::arrange(Species, id, desc(value)) %>% 
  dplyr::group_by(Species, id ) %>% 
  top_n(n = 5) %>% 
  dplyr::mutate(category = "high")

对于最低的5个值，我使用的是top_n(n = -5)以外的相同内容。

df_l <-  iris %>% 
  dplyr::select(Species, everything()) %>% 
  tidyr::gather("id", "value", 2:5) %>% 
  dplyr::arrange(Species, id, desc(value)) %>% 
  dplyr::group_by(Species, id ) %>% 
  top_n(n = -5) %>% 
  dplyr::mutate(category = "low")

然后，我将两个data.frames加在一起df_h（最高的5个值）和df_l（最低的5个值）。

df_fin <-  df_h %>% bind_rows(., df_l)

我正在寻找一种有效/更短的方法来获得相同的结果，而无需创建两个data.frame并加入它们。任何建议将不胜感激。

Answer 1

如果您只想提取极值，可以将top_n的两个应用与filter中的复合条件合并（请注意top_n只是filter的快捷方式1}}使用min_rank）：

    library(tidyverse)

    iris %>% 
          gather("dims", "value", -Species) %>%
          group_by(Species, dims) %>%
          filter( min_rank(desc(value)) <= 5 | 
                    min_rank(value) <= 5 ) -> df_hi_lo

但是，这不包括高/低分类。

更灵活的解决方案是使用一个返回这些类别名称之一或空字符串的函数：

hilo <- function(x, n) {
  hi_rk <- min_rank(desc(x))  # change rank function as needed
  lo_rk <- min_rank(x)
  paste0(ifelse(hi_rk <= n, "high", ""),
                ifelse(lo_rk <= n, "low",""))

我在这里使用了min_rank函数，它复制了top_n的行为，但您也应该考虑将其替换为dense_rank。

这允许您为所有行添加类别，然后过滤到高/低值：

iris %>% gather("dims", "value", -Species) %>%
  group_by(Species, dims) %>%
  mutate(category = hilo(value, 5) ) %>%
  filter(category != "") -> df_hl

Answer 2

如果我正确理解您的问题，我认为您可以使用排名功能以编程方式在单个数据框中执行此任务。

我使用iris数据集（下面）汇集了一个示例。基本上，它除了你的初始起点之外还做了三件事：

创建一个名为rank的临时变量，用于计算并确定值在其值列中的位置。我使用dplyr中的dense_rank作为概念证明，但根据您的目的，您可能需要不同的排名函数。
计算high_category_lower_bound，这是该组排名第五的最高值。它还设置low_category_upper_bound，这对于以编程方式控制它很有用，但在此示例中硬编码为5.
过滤数据框以仅包含小于或等于low_category_upper_bound或大于或等于high_category_lower_bound的值，然后执行ifelse查找以创建类别

注意：

我还在每个列/值对上强制distinct。我这样做是为了使解决方案更简单。如果您想要非不同的值，您可能需要调整排名功能。密集级别给予他们相同的排名，因此他们将被多次返回，而不是首先区分它。
我将2:5替换为1:ncol(.)，但您的里程可能会因您实际使用的确切数据集而异。
如果您关注数据集中包含记录的原因，您可以保留我创建的一些临时列，然后从最终结果中删除。
根据您实际使用的数据集的比例，此解决方案可能比您想要的效果更高或更低。该解决方案的缺点是它必须对所有值进行排序，这在大规模数据集上可能是昂贵的。我个人喜欢使用该解决方案，看看它是否能够满足您的需求，但需要注意的是。在iris数据集上，它很快就会返回。在几百万行上，可能需要更长的时间。

library(dplyr)
library(tidyr)

df_all <- 
    iris %>%
    # gather all columns
    gather("column", "value", 1:ncol(.)) %>% 
    # filter to only values which can be evaluated as high/low;
    # you could expand this to include factor variables, but 
    # that's beyond the scope of this question and you'd have to
    # redefine the factor levels before this step
    filter(!is.na(as.numeric(value))) %>%
    # get distinct values - optional but probably helpful
    distinct(column, value) %>%
    # group by and perform set operations on 
    group_by(column) %>%
    # create ranking sequence
    mutate(
        rank = dense_rank(value),
        low_category_upper_bound = 5,
        high_rank = max(rank),
        high_category_lower_bound = high_rank - 4 
    ) %>%
    # retain only top and bottom values
    # filter and create category label
    filter(
        rank <= low_category_upper_bound | 
        rank >= high_category_lower_bound
    ) %>%
    mutate(
        category = ifelse(rank >= high_category_lower_bound, "high", ""),
        category = ifelse(rank <= low_category_upper_bound, "low", category)
    ) %>%
    # select columns of interest
    select(column, value, category)

dplyr：汇总data.frame以获取最高和最低值

2 个答案: