Question

I split a dataframe to create a dataframe list. The dataframe list has 401 dataframes. In other words, each dataframe is identical in structure (same columns), but potentially different numbers of rows.

When I split the dataframe, I introduced 0 variance columns (colSums=0). Dataframes in the list may share 0 variance columns, or they may have totally different columns with 0 variance.

I have used the following function (from Quickly remove zero variance variables from a data.frame) to remove 0 variance columns from each dataset:

zeroVar <- function(data, useNA = 'ifany') {   out <- apply(data, 2,
function(x) {length(table(x, useNA = useNA))})   which(out==1) }

When I pass my data frame list to the function (ignoring the first two character columns of dataframe_list):

dataframe_list_zero_var_rm<-lapply(dataframe_list, function(d) d[,-zeroVar(d[,3:ncol(d)], useNA = 'no')])

No errors/flags are thrown.

However, while dataframes in dataframe_list_zero_var_rm have fewer columns than they do in dataframe_list, they still have columns that have zero variance, as revealed by:

zeroVar(dataframe_list_zero_var_rm[[1]][,3:ncol(dataframe_list_zero_var_rm)], useNA = 'no')

Passing the new dataframe to the original function shows me three columns with 0 variance which should have been removed in the first place.

This is a problem for me because I am trying to do principal components analysis on every dataframe in the list, but the zero variance columns become problematic for prcomp().

My ideal solution would be a way to

loop through each element of the dataframe list and remove columns from each dataframe that have zero variance
then, loop through each element of the dataframe list and perform prcomp() on the dataframe

Answer 1

You can use this approach from data.table:

library(data.table)
lapply(df_list,setDT) #convert all of your data.frames to data.tables

all_pos_var<-
  lapply(df_list,function(dt){
    dt[,unlist(dt[,lapply(names(dt)[3:ncol(dt)],
                          function(x){
      if(diff(range(get(x)))!=0)x})]),with=F]})

The inner lapply gets the column names of all non-0-variance (equivalent to non-0-range) functions: lapply(names(dt),function(x)if(diff(range(get(x)))!=0)x).

The outer lapply applies this procedure to all of your data.frame/data.tables.

Test data:

set.seed(101)
dt1<-data.frame(ig1=rnorm(10),ig2=rnorm(10),
                zv1=rep(1,10),nzv2=runif(10),
                zv3=rep(2,10),nzv4=runif(10))
dt2<-data.frame(ig1=rnorm(10),ig2=rnorm(10),
                zv1=rep(3,10),nzv2=rnorm(10),
                zv3=rep(4,10),nzv4=rnorm(10),
                zv5=rep(5,10),nzv6=rnorm(10))
df_list<-list(dt1,dt2)

Only nzv* variables should be returned; indeed:

> lapply(all_pos_var,names)
[[1]]
[1] "nzv2" "nzv4"

[[2]]
[1] "nzv2" "nzv4" "nzv6"

On trying to wrap your head around double lapply:

First, try to understand what the inner lapply is doing by focusing on a single data.frame:

setDT(dt1)
rel_cols<-names(dt1)[3:ncol(dt1)]

The inner lapply is:

nzcols<-dt1[,lapply(rel_cols,function(x)if(diff(range(get(x)))!=0)x)]
> nzcols
     V1   V2
1: nzv2 nzv4

The unlist part converts nzcols to a character vector, which can then be used to subset dt1 (note that we need to use the parameter with=F when passing quoted column names to a data.table):

> dt1[,unlist(nzcols),with=F]
          nzv2       nzv4
 1: 0.43496175 0.07921225
 2: 0.44205468 0.43388945
 3: 0.76068946 0.67977425
 4: 0.33296130 0.73435624
 5: 0.39435715 0.45251087
 6: 0.23329428 0.78378572
 7: 0.07160766 0.67983554
 8: 0.91338349 0.51870365
 9: 0.77169357 0.69080575
10: 0.10753664 0.58827565

The outer lapply simply applies this procedure to all of the data.tables in df_list.

R: Removing zero variance columns from each element of dataframe list

1 个答案: