找到重复项,平均它们并创建一个合适的表

时间:2014-05-13 12:43:54

标签: r

让我们从我的数据开始:

> dput(head(tbl_end))
structure(list(`Gene name` = c("at1g01050.1", "at1g01080.1", 
"at1g01090.1", "at1g01220.1", "at1g01320.2", "at1g01420.1"), 
    `1_1` = c(0, 0, 0, 0, 0, 0), `1_2` = c(0, 0, 0, 0, 0, 0), 
    `1_3` = c(0, 1, 0, 0, 0, 0), `1_4` = c(0, 0.660693687777888, 
    0, 0, 0, 0), `1_5` = c(0, 0.521435654491704, 0, 0, 0, 1), 
    `1_6` = c(0, 0.437291194705566, 0, 0, 0, 1), `1_7` = c(0, 
    0.52204783488213, 0, 0, 0, 0), `1_8` = c(0, 0.524298383907171, 
    0, 0, 0, 0), `1_9` = c(1, 0.376865096972469, 0, 1, 0, 0), 
    `1_10` = c(0, 0, 0, 0, 0, 0), `1_11` = c(0, 0, 0, 0, 0, 0
    ), `1_12` = c(0, 0, 0, 0, 0, 0), `1_13` = c(0, 0, 0, 0, 0, 
    0), `1_14` = c(0, 0, 0, 0, 0, 0), `1_15` = c(0, 0, 0, 0, 
    0, 0), `1_16` = c(0, 0, 0, 0, 0, 0), `1_17` = c(0, 0, 0, 
    0, 0, 0), `1_18` = c(0, 0, 0.476101907006443, 0, 0, 0), `1_19` = c(0, 
    0, 1, 0, 0, 0), `1_20` = c(0, 0, 0, 0, 0, 0), `1_21` = c(0, 
    0, 0, 0, 1, 0), `1_22` = c(0, 0, 0, 0, 0, 0), `1_23` = c(0, 
    0, 0, 0, 0, 0), `1_24` = c(0, 0, 0, 0, 0, 0)), .Names = c("Gene name", 
"1_1", "1_2", "1_3", "1_4", "1_5", "1_6", "1_7", "1_8", "1_9", 
"1_10", "1_11", "1_12", "1_13", "1_14", "1_15", "1_16", "1_17", 
"1_18", "1_19", "1_20", "1_21", "1_22", "1_23", "1_24"), row.names = c(NA, 
6L), class = "data.frame")

所以我有超过2k行。作为行的名称,我设置了基因名称,但是存在问题。有时相同的基因有不同的模型" (所以他们把点后面的点和数字1或2)但是它仍然是相同的基因所以我想找到所有这些重复的(相同的基因名称)并平均该基因的不同列中的值和只留下1行的平均值。

有可能吗?

只显示我的一些基因名称:

> dput(vec_names)
c("at1g01050.1", "at1g01080.1", "at1g01090.1", "at1g01220.1", 
"at1g01320.2", "at1g01420.1", "at1g01470.1", "at1g01800.1", "at1g01910.5", 
"at1g01920.2", "at1g01960.1", "at1g01980.1", "at1g02020.2", "at1g02100.2", 
"at1g02130.1", "at1g02140.1", "at1g02150.1", "at1g02305.1", "at1g02500.2", 
"at1g02560.1", "at1g02780.1", "at1g02880.3", "at1g02920.1", "at1g02930.2", 
"at1g03030.1", "at1g03090.2", "at1g03110.1", "at1g03130.1", "at1g03210.1", 
"at1g03220.1", "at1g03230.1", "at1g03310.2", "at1g03330.1", "at1g03475.1", 
"at1g03630.2", "at1g03680.1", "at1g03870.1", "at1g03900.1", "at1g04080.2", 
"at1g04130.1", "at1g04170.1", "at1g04190.1", "at1g04270.2", "at1g04350.1", 
"at1g04410.1", "at1g04420.1", "at1g04530.1", "at1g04640.2", "at1g04690.1", 
"at1g04750.2", "at1g04810.1", "at1g04850.1", "at1g04870.2", "at1g05010.1", 
"at1g05180.1", "at1g05190.1", "at1g05320.3", "at1g05350.1", "at1g05520.1", 
"at1g05560.1", "at1g05620.2", "at1g06000.1", "at1g06110.1", "at1g06130.2", 
"at1g06290.1", "at1g06410.1", "at1g06550.1", "at1g06560.1", "at1g06570.1", 

我认为有一个功能,但无法找到它。

1 个答案:

答案 0 :(得分:4)

使用data.table

library(data.table)
dt <- data.table(dat)
dt[, gene_unique := gsub("[.]*", "", dt$Gene)]
cols <- colnames(dt)[2:25]
dt[, lapply(.SD, mean), by = gene_unique, .SDcols = cols]

根据评论

中的建议使用aggregate
dat$`Gene name` = gsub("[.]*", "", dat$Gene)
aggregate(. ~ `Gene name`, dat, mean)