合并具有相同ID的行并取平均值

时间:2015-09-02 08:17:36

标签: r

从下表中我需要通过计算具有相同ID(第2列)的那些行的平均值来组合这些行。 我在考虑plyr函数??

ddply(df, summarize, value = average(ID))

DF:

      miRNA      ID  100G  100R 106G  106R  122G  122R 124G  124R  126G 126R  134G  134R 141G  141R 167G 167R 185G  185R
1   hsa-miR-106a ID7 1585   423  180   113   598   266  227   242    70  106  2703   442  715   309  546  113  358   309
2 hsa-miR-1185-1 ID2   10     1    3     3    11     8    4     4    28    2    13     3    6     3    6    4    7     5
3 hsa-miR-1185-2 ID2    2     0    2     1     5     1    1     0     4    1     1     1    3     2    2    0    2     1
4   hsa-miR-1197 ID2    2     0    0     5     3     3    0     4    16    0     4     1    3     0    0    2    2     4
5    hsa-miR-127 ID3   29    17    6    55    40    35    6    20   171   10    32    21   23    25   10   14   32    55

原始数据摘要:

> str(ClusterMatrix)
'data.frame':   113 obs. of  98 variables:
 $ miRNA: Factor w/ 202 levels "hsa-miR-106a",..: 1 3 4 6 8 8 14 15 15 16 ...
 $ ID   : Factor w/ 27 levels "ID1","ID10","ID11",..: 25 12 12 12 21 21 12 21 21 6 ...
 $ 100G : Factor w/ 308 levels "-0.307749042739963",..: 279 11 3 3 101 42 139 158 215 222 ...
 $ 100R : Factor w/ 316 levels "-0.138028803567403",..: 207 7 8 8 18 42 128 183 232 209 ...
 $ 106G : Factor w/ 260 levels "-0.103556709881933",..: 171 4 1 3 7 258 95 110 149 162 ...
 $ 106R : Factor w/ 300 levels "-0.141810346640204",..: 141 4 6 2 108 41 146 196 244 267 ...
 $ 122G : Factor w/ 336 levels "-0.0409548922061764",..: 237 12 4 6 103 47 148 203 257 264 ...
 $ 122R : Factor w/ 316 levels "-0.135708706475279",..: 177 1 8 6 36 44 131 192 239 244 ...
 $ 124G : Factor w/ 267 levels "-0.348439853247856",..: 210 5 2 3 7 50 126 138 188 249 ...
 $ 124R : Factor w/ 303 levels "-0.176414190219115",..: 193 3 7 3 21 52 167 200 238 239 ...
 $ 126G : Factor w/ 307 levels "-0.227658806811544",..: 122 88 5 76 169 61 240 220 281 265 ...
 $ 126R : Factor w/ 249 levels "-0.271925865853123",..: 119 1 2 3 11 247 78 110 151 193 ...
 $ 134G : Factor w/ 344 levels "-0.106333543799583",..: 304 14 8 5 33 48 150 196 248 231 ...
 $ 134R : Factor w/ 300 levels "-0.0997616469801097",..: 183 5 7 7 22 298 113 159 213 221 ...
 $ 141G : Factor w/ 335 levels "-0.134429748398679",..: 253 7 3 3 24 29 142 137 223 302 ...
 $ 141R : Factor w/ 314 levels "-0.143299688877927",..: 210 4 5 7 98 54 154 199 255 251 ...
 $ 167G : Factor w/ 306 levels "-0.211181452126958",..: 222 7 4 6 11 292 91 101 175 226 ...
 $ 167R : Factor w/ 282 levels "-0.0490740880560127",..: 130 2 6 4 15 282 110 146 196 197 ...
 $ 185G : Factor w/ 317 levels "-0.0567841338235346",..: 218 2 7 7 33 34 130 194 227 259 ...

1 个答案:

答案 0 :(得分:1)

我们可以使用dplyr。我们按ID'分组,使用mutate_each创建显示mean值为' 100G'的列。到' 185R'。我们使用mutate_each中的正则表达式模式选择matches中的列。然后cbindbind_cols)包含mutate d列的原始数据集,并根据需要转换为data.frame。我们还可以更改mean列的列名。

library(dplyr)
out <- df1 %>%
        group_by(ID) %>% 
        mutate_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+')) %>%
        setNames(., c(names(.)[1:2], paste0('Mean_', names(.)[3:ncol(.)]))) %>%
        as.data.frame()

out1 <- bind_cols(df1, out[-(1:2)])
out1
#           miRNA  ID 100G 100R 106G 106R 122G 122R 124G 124R 126G 126R 134G
#1   hsa-miR-106a ID7 1585  423  180  113  598  266  227  242   70  106 2703
#2 hsa-miR-1185-1 ID2   10    1    3    3   11    8    4    4   28    2   13
#3 hsa-miR-1185-2 ID2    2    0    2    1    5    1    1    0    4    1    1
#4   hsa-miR-1197 ID2    2    0    0    5    3    3    0    4   16    0    4
#5    hsa-miR-127 ID3   29   17    6   55   40   35    6   20  171   10   32
#  134R 141G 141R 167G 167R 185G 185R   Mean_100G   Mean_100R  Mean_106G
#1  442  715  309  546  113  358  309 1585.000000 423.0000000 180.000000
#2    3    6    3    6    4    7    5    4.666667   0.3333333   1.666667
#3    1    3    2    2    0    2    1    4.666667   0.3333333   1.666667
#4    1    3    0    0    2    2    4    4.666667   0.3333333   1.666667
#5   21   23   25   10   14   32   55   29.000000  17.0000000   6.000000
#  Mean_106R  Mean_122G Mean_122R  Mean_124G  Mean_124R Mean_126G Mean_126R
#1       113 598.000000       266 227.000000 242.000000        70       106
#2         3   6.333333         4   1.666667   2.666667        16         1
#3         3   6.333333         4   1.666667   2.666667        16         1
#4         3   6.333333         4   1.666667   2.666667        16         1
#5        55  40.000000        35   6.000000  20.000000       171        10
#  Mean_134G  Mean_134R Mean_141G  Mean_141R  Mean_167G Mean_167R  Mean_185G
#1      2703 442.000000       715 309.000000 546.000000       113 358.000000
#2         6   1.666667         4   1.666667   2.666667         2   3.666667
#3         6   1.666667         4   1.666667   2.666667         2   3.666667
#4         6   1.666667         4   1.666667   2.666667         2   3.666667
#5        32  21.000000        23  25.000000  10.000000        14  32.000000
#   Mean_185R
#1 309.000000
#2   3.333333
#3   3.333333
#4   3.333333
#5  55.000000

编辑:如果我们需要为每个ID&#39;添加一行mean,我们可以使用summarise_each

df1 %>%
  group_by(ID) %>%
  summarise_each(funs(mean=mean(., na.rm=TRUE)), matches('^\\d+'))

EDIT2:根据OP的更新,原始数据集(&#39; ClusterMatrix&#39;)列都是factor类。在获取numeric之前,我们需要将列转换为mean类。有两种方法可以将factor转换为numeric - 1)as.numeric(as.character(..,这可能会慢一些,2)as.numeric(levels(..更快。在这里,我使用第一种方法,因为它可能更清楚。

ClusterMatrix %>% 
      group_by(ID) %>% 
      summarise_each(funs(mean= mean(as.numeric(as.character(.)), 
            na.rm=TRUE)), matches('^\\d+'))

数据

df1 <- structure(list(miRNA = c("hsa-miR-106a", "hsa-miR-1185-1",
"hsa-miR-1185-2", 
"hsa-miR-1197", "hsa-miR-127"), ID = c("ID7", "ID2", "ID2", "ID2", 
"ID3"), `100G` = c(1585L, 10L, 2L, 2L, 29L), `100R` = c(423L, 
1L, 0L, 0L, 17L), `106G` = c(180L, 3L, 2L, 0L, 6L), `106R` = c(113L, 
3L, 1L, 5L, 55L), `122G` = c(598L, 11L, 5L, 3L, 40L), `122R` = c(266L, 
8L, 1L, 3L, 35L), `124G` = c(227L, 4L, 1L, 0L, 6L), `124R` = c(242L, 
4L, 0L, 4L, 20L), `126G` = c(70L, 28L, 4L, 16L, 171L), `126R` = c(106L, 
2L, 1L, 0L, 10L), `134G` = c(2703L, 13L, 1L, 4L, 32L), `134R` = c(442L, 
3L, 1L, 1L, 21L), `141G` = c(715L, 6L, 3L, 3L, 23L), `141R` = c(309L, 
3L, 2L, 0L, 25L), `167G` = c(546L, 6L, 2L, 0L, 10L), `167R` = c(113L, 
4L, 0L, 2L, 14L), `185G` = c(358L, 7L, 2L, 2L, 32L), `185R` = c(309L, 
5L, 1L, 4L, 55L)), .Names = c("miRNA", "ID", "100G", "100R", 
"106G", "106R", "122G", "122R", "124G", "124R", "126G", "126R", 
"134G", "134R", "141G", "141R", "167G", "167R", "185G", "185R"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"
))