我有~14000行~580列的数据帧。每个细胞含有表达值(RNA表达数据)。我根据每列的总和将df的每个值转换为百分比。
现在我要做的是排除所有元素的值都低于0.005的行。需要明确的是,如果除了一个元素之外的所有元素都具有低于0.005的值,则将保留该行。
我设法通过编写迭代遍历数据帧的所有行和列的两个重叠循环来执行此任务。但是完成起来很慢。
这是我的代码:
# Create empty data frame in which rows meeting criteria will be written.
df <- data.frame(matrix(ncol = ncol(tData2_perc), nrow = 0))
colnames(df) <- colnames(tData2_perc)
passed = 0
# Start loop. tData2_perc is the data frame containing all the perc. values.
for( i in 1:nrow(tData2_perc)){
for( j in 1:ncol(tData2_perc)){
if(tData2_perc[i,j] >= 0.0005){
passed = 1
}
}
if(passed == 1){
df = rbind(df, tData2_perc[i,])
}
passed = 0
}
是否有更优雅(计算速度更快?)的方式?我尝试使用apply,但找不到实现它的方法...... 谢谢!
编辑: 这是我的数据的子集(dput()输出):
structure(list(S002ED2S5MID86 = c(0.00506787330316742,0.000542986425339366,
0.000723981900452489, 0.0191855203619909, 0.00452488687782805,
0, 0, 0, 0, 0), AcBarrieBulk10120130703 = c(0.00729498574543015,
0.000419252054335066, 0.00117390575213819, 0.025071272849237,
0.00721113533456314, 0, 0, 0, 0, 0), PelisserRhizo30520130703 = c(0.0093628088426528,
0.00182054616384915, 0.00182054616384915, 0.0280884265279584,
0.00572171651495449, 0, 0, 0, 0, 0), S002F76S3MID96 = c(0.000578452639190166,
0.000144613159797542, 0.00101229211858279, 0.0190889370932755,
0.00289226319595083, 0, 0.000144613159797542, 0, 0.000144613159797542,
0), S002ED0S3MID102 = c(0.249181043896047, 0.0437504549756133,
0.118293659459853, 0.0249690616582951, 0.0470990754895538, 0,
0, 0.000218388294387421, 0, 0)), .Names = c("S002ED2S5MID86",
"AcBarrieBulk10120130703", "PelisserRhizo30520130703", "S002F76S3MID96",
"S002ED0S3MID102"), row.names = c(1L, 2L, 3L, 4L, 5L, 4001L,
4002L, 4003L, 4004L, 4005L), class = "data.frame")
答案 0 :(得分:1)
首先制作一个虚拟列,其中包含所有其他列的pmax
。然后按该列过滤。然后,您可以删除虚拟列
tData2_perc$filt<-do.call(pmax, tData2_perc)
df<-tData2_perc[tData2_perc$filt>.005,]
tData2_perc$filt<-NULL
如果要排除超过1个异常的行,请执行以下操作。
创建一个虚拟列,它是满足(或不符合您的条件)的列的总和。然后根据符合您规范的列数进行子集化。
tData2_perc$filt<-apply(tData2_perc, 1, function(x) sum(x>0.005)) #you can change the greater than to less than if you want to invert the count.
df<-tData2_perc[tData2_perc$filt>=2,] #the 2 is made up by me for the case of wanting 2 or more columns that are .005 or greater. Change the 2 for your needs
tData2_perc$filt<-NULL #deleting dummy columns
df$filt<-NULL