我有下表名为m:
Identifier DAT_SN_e15.5_1 DAT_SN_e15.5_2 DAT_SN_p2_1 DAT_SN_p2_2
100009600 3 1 0 0
100009609 13 4 1 6
100009614 0 0 0 0
100009664 9 17 5 7
100012 0 0 0 0
100017 0 0 0 0
100019 1275 70 54 353
100033459 0 0 0 0
100034251 0 0 0 0
100034361 277 4 114 830
第1列是基因标识符,第2列和第3列是DAT_SN_e15.5的生物学重复,第4列和第5列是DAT_SN_p2的生物学重复。我的真实世界数据由56个这样的样本组成,每个样本有2个重复。有没有办法根据名称识别复制品,唯一的区别是名称末尾的1或2?
如果是这样,我怎么能创建一个新的表m.rep< - 平均每个标识符和每个样本的2个值并包含基因标识符,名为DAT_SN_e15.5_ave和DAT_SN_p2_ave的列。
答案 0 :(得分:0)
一个想法是使用模糊搜索或使用agrep
进行模式的近似匹配。
## you replace nn by your colnames
nn <- c('DAT_SN_e15.5_1','DAT_SN_e15.5_2','DAT_SN_p2_1','DAT_SN_p2_2')
## for each column name find which column is approximately similar
ll <- lapply(seq_along(nn),function(x)
nn[agrep(nn[x],nn)])
## remove duplicate since a is similar to n and b is similar to a
ll[!duplicated(ll)]
[[1]]
[1] "DAT_SN_e15.5_1" "DAT_SN_e15.5_2"
[[2]]
[1] "DAT_SN_p2_1" "DAT_SN_p2_2"
编辑此处如何使用您的数据
来使用上述内容dat <- read.table(text='Identifier DAT_SN_e15.5_1 DAT_SN_e15.5_2 DAT_SN_p2_1 DAT_SN_p2_2
100009600 3 1 0 0
100009609 13 4 1 6
100009614 0 0 0 0
100009664 9 17 5 7
100012 0 0 0 0
100017 0 0 0 0
100019 1275 70 54 353
100033459 0 0 0 0
100034251 0 0 0 0
100034361 277 4 114 830',header=TRUE)
nn <- colnames(dat)[-1]
ll <- lapply(seq_along(nn),function(x)
nn[agrep(nn[x],nn)])
ll <- ll[!duplicated(ll)]
res <- lapply(ll,function(x)rowMeans(dat[,x]))
res <- t(do.call(rbind,res))
## i take the first element of the pair as a column name
colnames(res) <- lapply(ll,'[[',1)
DAT_SN_e15.5_1 DAT_SN_p2_1
[1,] 2.0 0.0
[2,] 8.5 3.5
[3,] 0.0 0.0
[4,] 13.0 6.0
[5,] 0.0 0.0
[6,] 0.0 0.0
[7,] 672.5 203.5
[8,] 0.0 0.0
[9,] 0.0 0.0
[10,] 140.5 472.0