我正在使用R中的dplyr
包来过滤我的基因表达数据。我已经计算出倍数变化,并希望过滤至少一个样本(列)的值大于+0.584963或小于-0.584963的基因(行)。示例数据:
X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1 0.6780 0.4050 0.8870 0.3300 0.2230
GENE_2 0.2340 -0.6670 0.0020 0.1240 0.3560
GENE_3 0.0170 0.1560 0.1120 0.0080 -0.1230
GENE_4 -0.0944 -0.1372 -0.1800 -0.2228 -0.2656
GENE_5 -0.8080 -0.7800 -0.5560 0.0340 0.4450
GENE_6 0.2091 0.1106 0.0121 -0.0864 -0.1849
GENE_7 0.5980 0.7680 0.9970 0.4670 -0.7760
我当前正在使用以下脚本
det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("`",det.cols,"`",">abs(0.584963)", sep = "")))
my.datasub<- my.data %>% filter_(filt)
但是这仅返回大于+0.584963的基因,而不返回负基因。在该示例的情况下,我想要的是具有基因1、2、5和7的子集列表。但是相反,它仅给我基因1和7。我该如何更改?
我希望答案采用以下格式:
X SAMPLE_1_FC SAMPLE_2_FC SAMPLE_3_FC SAMPLE_4_FC SAMPLE_5_FC
GENE_1 0.6780 0.4050 0.8870 0.3300 0.2230
GENE_2 0.2340 -0.6670 0.0020 0.1240 0.3560
GENE_5 -0.8080 -0.7800 -0.5560 0.0340 0.4450
GENE_7 0.5980 0.7680 0.9970 0.4670 -0.7760
谢谢。
答案 0 :(得分:1)
使用filter_at
中的dplyr
可能是一种更加灵活的方法...
# set up sample data with 50000 rows [as proposed by Arthur Yip above]
mydata <- tibble(X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))
# duplicate 30 more columns [as proposed by Arthur Yip above]
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])
mydata2 %>%
filter_at(vars(contains("fc")), .vars_predicate = any_vars(abs(.) > 0.584963))
在vars()
中,您可以定义要对其应用过滤的变量列表。在.vars_predicate
之后,您可以定义过滤条件(any_vars
等于|
,all_vars
等于&
)。
答案 1 :(得分:1)
长话短说,您在代码中的错误位置放置了abs()
。
我在这里固定了它
det.cols<- colnames(my.data)[which(grepl("fc",tolower(colnames(my.data))))]
filt <- gsub(","," | ",toString(paste("abs(`",det.cols,"`)",">0.584963", sep = "")))
my.datasub<- my.data %>% filter_(filt)
为了获得更大的灵活性,@ ha_pu提供了一个很棒的filter_at
解决方案,该解决方案以我之前的解决方案为基础(在我确定您的代码中的错误之前)。
答案 2 :(得分:0)
这是一个可以灵活处理样本和数据行数的解决方案。它涉及将数据转换为长格式,然后过滤基因和特定样本。我在50k个基因和35个样本上对其进行了测试,并且运行了<1秒。
library(tidyverse)
# set up sample data with 50000 rows
mydata <- data.frame(stringsAsFactors=FALSE,
X = c("GENE_1", "GENE_2", "GENE_3", "GENE_4", "GENE_5", "GENE_6", "GENE_7", 1:50000),
SAMPLE_1_FC = c(0.678, 0.234, 0.017, -0.0944, -0.808, 0.2091, 0.598, rnorm(50000, 0, 1)),
SAMPLE_2_FC = c(0.405, -0.667, 0.156, -0.1372, -0.78, 0.1106, 0.768, rnorm(50000, 0, 1)),
SAMPLE_3_FC = c(0.887, 0.002, 0.112, -0.18, -0.556, 0.0121, 0.997, rnorm(50000, 0, 1)),
SAMPLE_4_FC = c(0.33, 0.124, 0.008, -0.2228, 0.034, -0.0864, 0.467, rnorm(50000, 0, 1)),
SAMPLE_5_FC = c(0.223, 0.356, -0.123, -0.2656, 0.445, -0.1849, -0.776, rnorm(50000, 0, 1)))
# duplicate 30 more columns
mydata2 <- bind_cols(mydata, mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6], mydata[2:6])
(mydata3 <- mydata2 %>% gather(key = "sample_num", value = "fc", 2:length(mydata)) %>%
filter(fc > 0.584963 | fc < -0.584963) %>%
select(X) %>%
arrange(desc(X)) %>%
unique() %>%
head())
#> X
#> 1 GENE_7
#> 5 GENE_5
#> 7 GENE_2
#> 8 GENE_1
#> 10 9999
#> 14 9998
由reprex package(v0.2.1)于2019-03-01创建