如何从文件列表中删除零和的列

时间:2016-07-13 07:39:25

标签: r automation pca

以下是在myfiles中存储的多个数据帧上应用PCA的代码。

## Get file names for a working directory ###
temp = list.files(pattern="*.csv")

## Read files ###
myfiles = lapply(temp, read.csv)

### Name the files ###

names(myfiles)<-c("mCRC_2015_Q1","mCRC_2015_Q2","mCRC_2015_Q3","mCRC_2015_Q4")

##### to check the names of the columns #######
names(myfiles$mCRC_2015_Q1)

##### to change the names of the columns ######

colnames = c("Insufficient efficacy","Issues around safety/tolerability","Inconvenient dosage regimen/administration","Price issues","Not reimbursed","Not included on hospital/government medicines formulary","Insufficient clinical data available for acceptance","Previously used for this patient","Prescription only possible in selected cases with detailed justification to authorities / payers ","I don’t have enough scientific information about it","Lack of experience in this setting","Involved in clinical trial with other drugs","Patient not appropriate for Targeted therapy","Patient not appropriate for cetuximab (Erbitux)","Others","Country") 


for (i in seq_along(myfiles)){
  colnames(myfiles[[i]]) <- colnames
}

##### Delete all those columns which have zero sum from each dataframe #####
for(i in 1:length(myfiles)){

  myfiles[[i]] <- myfiles[[i]][,which(!lapply(myfiles,FUN = function(x){colSums(x!=0)>0}))]

}

####### Run PCA for each dataframe country wise ####
Myfiles<- split(myfiles, myfiles$Country)
for(i in 1:length(Myfiles)){
  assign(paste0("pca", i), prcomp(Myfiles[[i]][which(names(myfiles)!="Country")], center=T, scale.=T))
}

这些是我面临的问题:
1)如何删除所有那些只有零值的列 2)我们如何在每个数据框上按国家/地区应用prcomp命令(国家/地区是数据框中的变量之一)
3)从加载矩阵中,我如何获得每个数据帧的前4个最相关变量(无论符号如何)。

2 个答案:

答案 0 :(得分:0)

我对问题1的回答,如何删除data.frame中只有零值的列:

exampledat <- data.frame( zero = rep(0,20), one= rep(1, 20), 
                       two = rep(1, 20),
                       mixed = c(0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
                       zeroagain = rep(0,20))

del_zero_cols <- function(datafr){
  datafr[,apply(datafr,2, function(value) any(value!=0, na.rm=TRUE))]
}

del_zero_cols(exampledat)

答案 1 :(得分:0)

有几个选择。我会做这样的事情来处理值:

myfiles <- 
  list(
    q1 = data.frame(r_1 = rep(0, times = 3), r_2  = c(0,0,1), country = c(LETTERS[1:3])),
    q2 = data.frame(r_1 = rep(0, times = 3), r_2  = c(0,0,0), country = c(LETTERS[1:3])),
    q2 = data.frame(r_1 = rep(0, times = 3), r_2  = c(1,0,0), country = c(LETTERS[1:3]))
  )

# Merge the dataframes into one
merged_myfiles = do.call(rbind, myfiles)
merged_myfiles$file = gsub("\\.[0-9]+$", "", rownames(merged_myfiles))

# Clean columns that are all 0
cleaned_data = merged_myfiles[,!sapply(merged_myfiles, function(col) all(col == 0))]

# The by is a neat base function that allows you to do things on subsets
# the output is a list
by(cleaned_data, cleaned_data$country, function(df){
  mean(df$r_2)
})

# Used dplyr for grouping analyses
library(dplyr)
library(magrittr)
cleaned_data %>% 
  group_by(country) %>% 
  do({
    data.frame(mean = mean(.$r_2), file = .$file[1])
  })

by选项为您提供:

cleaned_data$country: A
[1] 0.3333333
------------------------------------------------------------------------------------------------------------------------------------------ 
cleaned_data$country: B
[1] 0
------------------------------------------------------------------------------------------------------------------------------------------ 
cleaned_data$country: C
[1] 0.3333333

虽然dplyr给出了:

Source: local data frame [3 x 3]
Groups: country [3]

  country      mean   file
   <fctr>     <dbl> <fctr>
1       A 0.3333333     q1
2       B 0.0000000     q1
3       C 0.3333333     q1

为了选择最大prcomp输出,我建议采用以下方法:

prcomp(USArrests) %>% 
  extract("rotation") %>% 
  unlist() %>% 
  abs() %>% 
  order(decreasing = TRUE) %>% 
  extract(1:4) %>% 
  data.frame(row = . %% ncol(USArrests),
             col = ceiling(. / ncol(USArrests)))

原始prcomp(USArrests)$rotation如下所示:

                PC1         PC2         PC3         PC4
Murder   0.04170432 -0.04482166  0.07989066 -0.99492173
Assault  0.99522128 -0.05876003 -0.06756974  0.03893830
UrbanPop 0.04633575  0.97685748 -0.20054629 -0.05816914
Rape     0.07515550  0.20071807  0.97408059  0.07232502

并且magrittr管道的输出准确显示了感兴趣的变量:

   . row col
1  2   2   1
2 13   1   4
3  7   3   2
4 12   0   3