R: Removing zero variance columns from each element of dataframe list

时间:2015-07-28 15:40:40

标签: r list dataframe

I split a dataframe to create a dataframe list. The dataframe list has 401 dataframes. In other words, each dataframe is identical in structure (same columns), but potentially different numbers of rows.

When I split the dataframe, I introduced 0 variance columns (colSums=0). Dataframes in the list may share 0 variance columns, or they may have totally different columns with 0 variance.

I have used the following function (from Quickly remove zero variance variables from a data.frame) to remove 0 variance columns from each dataset:

zeroVar <- function(data, useNA = 'ifany') {   out <- apply(data, 2,
function(x) {length(table(x, useNA = useNA))})   which(out==1) }

When I pass my data frame list to the function (ignoring the first two character columns of dataframe_list):

dataframe_list_zero_var_rm<-lapply(dataframe_list, function(d) d[,-zeroVar(d[,3:ncol(d)], useNA = 'no')])

No errors/flags are thrown.

However, while dataframes in dataframe_list_zero_var_rm have fewer columns than they do in dataframe_list, they still have columns that have zero variance, as revealed by:

zeroVar(dataframe_list_zero_var_rm[[1]][,3:ncol(dataframe_list_zero_var_rm)], useNA = 'no')

Passing the new dataframe to the original function shows me three columns with 0 variance which should have been removed in the first place.

This is a problem for me because I am trying to do principal components analysis on every dataframe in the list, but the zero variance columns become problematic for prcomp().

My ideal solution would be a way to

  • loop through each element of the dataframe list and remove columns from each dataframe that have zero variance
  • then, loop through each element of the dataframe list and perform prcomp() on the dataframe

1 个答案:

答案 0 :(得分:1)

You can use this approach from data.table:

library(data.table)
lapply(df_list,setDT) #convert all of your data.frames to data.tables

all_pos_var<-
  lapply(df_list,function(dt){
    dt[,unlist(dt[,lapply(names(dt)[3:ncol(dt)],
                          function(x){
      if(diff(range(get(x)))!=0)x})]),with=F]})

The inner lapply gets the column names of all non-0-variance (equivalent to non-0-range) functions: lapply(names(dt),function(x)if(diff(range(get(x)))!=0)x).

The outer lapply applies this procedure to all of your data.frame/data.tables.

Test data:

set.seed(101)
dt1<-data.frame(ig1=rnorm(10),ig2=rnorm(10),
                zv1=rep(1,10),nzv2=runif(10),
                zv3=rep(2,10),nzv4=runif(10))
dt2<-data.frame(ig1=rnorm(10),ig2=rnorm(10),
                zv1=rep(3,10),nzv2=rnorm(10),
                zv3=rep(4,10),nzv4=rnorm(10),
                zv5=rep(5,10),nzv6=rnorm(10))
df_list<-list(dt1,dt2)

Only nzv* variables should be returned; indeed:

> lapply(all_pos_var,names)
[[1]]
[1] "nzv2" "nzv4"

[[2]]
[1] "nzv2" "nzv4" "nzv6"

On trying to wrap your head around double lapply:

First, try to understand what the inner lapply is doing by focusing on a single data.frame:

setDT(dt1)
rel_cols<-names(dt1)[3:ncol(dt1)]

The inner lapply is:

nzcols<-dt1[,lapply(rel_cols,function(x)if(diff(range(get(x)))!=0)x)]
> nzcols
     V1   V2
1: nzv2 nzv4

The unlist part converts nzcols to a character vector, which can then be used to subset dt1 (note that we need to use the parameter with=F when passing quoted column names to a data.table):

> dt1[,unlist(nzcols),with=F]
          nzv2       nzv4
 1: 0.43496175 0.07921225
 2: 0.44205468 0.43388945
 3: 0.76068946 0.67977425
 4: 0.33296130 0.73435624
 5: 0.39435715 0.45251087
 6: 0.23329428 0.78378572
 7: 0.07160766 0.67983554
 8: 0.91338349 0.51870365
 9: 0.77169357 0.69080575
10: 0.10753664 0.58827565

The outer lapply simply applies this procedure to all of the data.tables in df_list.