I split a dataframe to create a dataframe list. The dataframe list has 401 dataframes. In other words, each dataframe is identical in structure (same columns), but potentially different numbers of rows.
When I split the dataframe, I introduced 0 variance columns (colSums=0). Dataframes in the list may share 0 variance columns, or they may have totally different columns with 0 variance.
I have used the following function (from Quickly remove zero variance variables from a data.frame) to remove 0 variance columns from each dataset:
zeroVar <- function(data, useNA = 'ifany') { out <- apply(data, 2,
function(x) {length(table(x, useNA = useNA))}) which(out==1) }
When I pass my data frame list to the function (ignoring the first two character columns of dataframe_list):
dataframe_list_zero_var_rm<-lapply(dataframe_list, function(d) d[,-zeroVar(d[,3:ncol(d)], useNA = 'no')])
No errors/flags are thrown.
However, while dataframes in dataframe_list_zero_var_rm have fewer columns than they do in dataframe_list, they still have columns that have zero variance, as revealed by:
zeroVar(dataframe_list_zero_var_rm[[1]][,3:ncol(dataframe_list_zero_var_rm)], useNA = 'no')
Passing the new dataframe to the original function shows me three columns with 0 variance which should have been removed in the first place.
This is a problem for me because I am trying to do principal components analysis on every dataframe in the list, but the zero variance columns become problematic for prcomp().
My ideal solution would be a way to
答案 0 :(得分:1)
You can use this approach from data.table
:
library(data.table)
lapply(df_list,setDT) #convert all of your data.frames to data.tables
all_pos_var<-
lapply(df_list,function(dt){
dt[,unlist(dt[,lapply(names(dt)[3:ncol(dt)],
function(x){
if(diff(range(get(x)))!=0)x})]),with=F]})
The inner lapply
gets the column names of all non-0-variance (equivalent to non-0-range) functions: lapply(names(dt),function(x)if(diff(range(get(x)))!=0)x)
.
The outer lapply
applies this procedure to all of your data.frame
/data.table
s.
Test data:
set.seed(101)
dt1<-data.frame(ig1=rnorm(10),ig2=rnorm(10),
zv1=rep(1,10),nzv2=runif(10),
zv3=rep(2,10),nzv4=runif(10))
dt2<-data.frame(ig1=rnorm(10),ig2=rnorm(10),
zv1=rep(3,10),nzv2=rnorm(10),
zv3=rep(4,10),nzv4=rnorm(10),
zv5=rep(5,10),nzv6=rnorm(10))
df_list<-list(dt1,dt2)
Only nzv*
variables should be returned; indeed:
> lapply(all_pos_var,names)
[[1]]
[1] "nzv2" "nzv4"
[[2]]
[1] "nzv2" "nzv4" "nzv6"
On trying to wrap your head around double lapply
:
First, try to understand what the inner lapply
is doing by focusing on a single data.frame
:
setDT(dt1)
rel_cols<-names(dt1)[3:ncol(dt1)]
The inner lapply
is:
nzcols<-dt1[,lapply(rel_cols,function(x)if(diff(range(get(x)))!=0)x)]
> nzcols
V1 V2
1: nzv2 nzv4
The unlist
part converts nzcols
to a character vector, which can then be used to subset dt1
(note that we need to use the parameter with=F
when passing quoted column names to a data.table
):
> dt1[,unlist(nzcols),with=F]
nzv2 nzv4
1: 0.43496175 0.07921225
2: 0.44205468 0.43388945
3: 0.76068946 0.67977425
4: 0.33296130 0.73435624
5: 0.39435715 0.45251087
6: 0.23329428 0.78378572
7: 0.07160766 0.67983554
8: 0.91338349 0.51870365
9: 0.77169357 0.69080575
10: 0.10753664 0.58827565
The outer lapply
simply applies this procedure to all of the data.table
s in df_list
.