Question

I have a data set with 100's of columns, I want to keep top 20 columns with highest average (can be other aggregation like sum or SD). How to efficiently do it? One way I think is to create a vector of averages of all columns, sort it descending and keep top n values in it then use it subset my data set. I am looking for a more elegant way and some thing that can also be part of dplyr pipe %>% flow.

code below for creating a dummy dataset, also I would appreciate suggestion for elegant ways to create dummy dataset.

#initialize data set
set.seed(101)
df <- as.data.frame(matrix(round(runif(25,2,5),0), nrow = 5, ncol = 5))

# add more columns 
for (i in 1:5){
  set.seed (101)
  df_stage <-
    as.data.frame(matrix(
      round(runif(25,5*i , 10*i), 0), nrow = 5, ncol = 5
    ))
  colnames(df_stage) <- paste("v",(10*i):(10*i+4))
  df <- cbind(df, df_stage)
}

Answer 1

We can do this with

library(dplyr)
n <- 3
df %>% 
  summarise_all(mean) %>%
  unlist %>%
  order(., decreasing = TRUE) %>%
  head(n) %>% 
  df[.]

Answer 2

Another tidyverse approach with a bit of reshaping:

library(tidyverse)

n = 3

df %>% 
  summarise_all(mean) %>%
  gather() %>%
  top_n(n, value) %>%
  pull(key) %>%
  df[.]

Select top n columns (based on an aggregation)

2 个答案: