通过识别列标题的名称来平均表中的值

时间:2013-05-21 11:40:26

标签: r

我有下表名为m:

Identifier  DAT_SN_e15.5_1  DAT_SN_e15.5_2  DAT_SN_p2_1 DAT_SN_p2_2
100009600   3           1           0           0
100009609   13          4           1           6
100009614   0           0           0           0
100009664   9           17          5           7
100012          0           0           0           0
100017          0           0           0           0
100019          1275            70          54          353
100033459   0           0           0           0
100034251   0           0           0           0
100034361   277         4           114         830

第1列是基因标识符,第2列和第3列是DAT_SN_e15.5的生物学重复,第4列和第5列是DAT_SN_p2的生物学重复。我的真实世界数据由56个这样的样本组成,每个样本有2个重复。有没有办法根据名称识别复制品,唯一的区别是名称末尾的1或2?

如果是这样,我怎么能创建一个新的表m.rep< - 平均每个标识符和每个样本的2个值并包含基因标识符,名为DAT_SN_e15.5_ave和DAT_SN_p2_ave的列。

1 个答案:

答案 0 :(得分:0)

一个想法是使用模糊搜索或使用agrep进行模式的近似匹配。

## you replace nn by your colnames
nn <- c('DAT_SN_e15.5_1','DAT_SN_e15.5_2','DAT_SN_p2_1','DAT_SN_p2_2')
## for each column name find which column is approximately similar
ll <- lapply(seq_along(nn),function(x)
          nn[agrep(nn[x],nn)]) 
## remove duplicate since a is similar to n and b is similar to a
ll[!duplicated(ll)]

[[1]]
[1] "DAT_SN_e15.5_1" "DAT_SN_e15.5_2"

[[2]]
[1] "DAT_SN_p2_1" "DAT_SN_p2_2"

编辑此处如何使用您的数据

来使用上述内容
dat <- read.table(text='Identifier  DAT_SN_e15.5_1  DAT_SN_e15.5_2  DAT_SN_p2_1 DAT_SN_p2_2
100009600   3           1           0           0
100009609   13          4           1           6
100009614   0           0           0           0
100009664   9           17          5           7
100012          0           0           0           0
100017          0           0           0           0
100019          1275            70          54          353
100033459   0           0           0           0
100034251   0           0           0           0
100034361   277         4           114         830',header=TRUE)

nn <- colnames(dat)[-1]

ll <- lapply(seq_along(nn),function(x)
  nn[agrep(nn[x],nn)])
ll <- ll[!duplicated(ll)]

res <- lapply(ll,function(x)rowMeans(dat[,x]))
res <- t(do.call(rbind,res))
## i take the first element of the pair as a column name
colnames(res) <- lapply(ll,'[[',1)


     DAT_SN_e15.5_1 DAT_SN_p2_1
 [1,]            2.0         0.0
 [2,]            8.5         3.5
 [3,]            0.0         0.0
 [4,]           13.0         6.0
 [5,]            0.0         0.0
 [6,]            0.0         0.0
 [7,]          672.5       203.5
 [8,]            0.0         0.0
 [9,]            0.0         0.0
[10,]          140.5       472.0