我有一个数据框(df
),看起来像以下内容(具有更多的列和行):
Cell_Cluster ARB2 DRAB2A FOXP2 ....
C18|O11.F2 2.234 0.315 3.325
C18|010.J2 0.215 1.215 -0.310
C18|S92.C1 -0.562 4.624 1.426
C20|O11.F2 1.150 -1.326 3.135
C20|S93.C2 -1.135 3.001 -2.932
C21|010.J2 2.125 1.250 0.013
.
.
.
Cell_Cluster
之后的列都是不同的基因。我要做的是按Cell_Cluster
分组(准确地说是“ |”之前的所有字符),然后在每个组中添加一列,代表每个基因的平均值。我该如何实现?
答案 0 :(得分:0)
我们假定输入数据帧可重复显示在末尾的注释中。
现在,假设您想要在原始数据帧上添加额外的列mean
,以便组中的每一行均值均等于该组中所有数字列的均值,因为所有这些数字的平均值等于该组中rowMeans的平均值,我们可以首先获取rowMeans,然后取该组中那些均值的平均值。例如,查看第4行和第5行
# mean of all elements in rows 4 and 5
mean(c(1.15, -1.326, 3.135, -1.135, 3.001, -2.932))
## [1] 0.3155
# take mean of row 4 and then mean of row 5 and then mean of those 2 means
mean(c(mean(c(1.15, -1.326, 3.135)), mean(c(-1.135, 3.001, -2.932))))
## [1] 0.3155
不使用任何软件包。
transform(DF, mean = ave(rowMeans(DF[-1]), sub("\\|.*","",Cell_Cluster), FUN = mean))
给予:
Cell_Cluster ARB2 DRAB2A FOXP2 mean
1 C18|O11.F2 2.234 0.315 3.325 1.386889
2 C18|010.J2 0.215 1.215 -0.310 1.386889
3 C18|S92.C1 -0.562 4.624 1.426 1.386889
4 C20|O11.F2 1.150 -1.326 3.135 0.315500
5 C20|S93.C2 -1.135 3.001 -2.932 0.315500
6 C21|010.J2 2.125 1.250 0.013 1.129333
Lines <- "
Cell_Cluster ARB2 DRAB2A FOXP2
C18|O11.F2 2.234 0.315 3.325
C18|010.J2 0.215 1.215 -0.310
C18|S92.C1 -0.562 4.624 1.426
C20|O11.F2 1.150 -1.326 3.135
C20|S93.C2 -1.135 3.001 -2.932
C21|010.J2 2.125 1.250 0.013"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)
答案 1 :(得分:0)
如果要对组中的每个基因(而不是单个列)求平均,则首先制作长格式数据可能会有所帮助。您可以同时使用tidyr
和data.table
软件包。
tidyr
方法library(tidyverse)
gene <-
read_table("Cell_Cluster ARB2 DRAB2A FOXP2
C18|O11.F2 2.234 0.315 3.325
C18|010.J2 0.215 1.215 -0.310
C18|S92.C1 -0.562 4.624 1.426
C20|O11.F2 1.150 -1.326 3.135
C20|S93.C2 -1.135 3.001 -2.932
C21|010.J2 2.125 1.250 0.013")
gather(key, value)
可以使数据变长。您可以指定列。
(gene1 <-
gene %>%
gather(-Cell_Cluster, key = key, value = value)) # gather except Cell_Cluster
#> # A tibble: 18 x 3
#> Cell_Cluster key value
#> <chr> <chr> <dbl>
#> 1 C18|O11.F2 ARB2 2.23
#> 2 C18|010.J2 ARB2 0.215
#> 3 C18|S92.C1 ARB2 -0.562
#> 4 C20|O11.F2 ARB2 1.15
#> 5 C20|S93.C2 ARB2 -1.14
#> 6 C21|010.J2 ARB2 2.12
#> 7 C18|O11.F2 DRAB2A 0.315
#> 8 C18|010.J2 DRAB2A 1.22
#> 9 C18|S92.C1 DRAB2A 4.62
#> 10 C20|O11.F2 DRAB2A -1.33
#> 11 C20|S93.C2 DRAB2A 3.00
#> 12 C21|010.J2 DRAB2A 1.25
#> 13 C18|O11.F2 FOXP2 3.32
#> 14 C18|010.J2 FOXP2 -0.31
#> 15 C18|S92.C1 FOXP2 1.43
#> 16 C20|O11.F2 FOXP2 3.14
#> 17 C20|S93.C2 FOXP2 -2.93
#> 18 C21|010.J2 FOXP2 0.013
由于您要按|
之前的cell_cluster分组(如果我理解正确的话),因此可以将该列分成两部分。由\\|
拆分。
gene1 %>%
separate(Cell_Cluster, into = c("cell", "cluster"),
sep = "\\|", remove = FALSE)
#> # A tibble: 18 x 5
#> Cell_Cluster cell cluster key value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 C18|O11.F2 C18 O11.F2 ARB2 2.23
#> 2 C18|010.J2 C18 010.J2 ARB2 0.215
#> 3 C18|S92.C1 C18 S92.C1 ARB2 -0.562
#> 4 C20|O11.F2 C20 O11.F2 ARB2 1.15
#> 5 C20|S93.C2 C20 S93.C2 ARB2 -1.14
#> 6 C21|010.J2 C21 010.J2 ARB2 2.12
#> 7 C18|O11.F2 C18 O11.F2 DRAB2A 0.315
#> 8 C18|010.J2 C18 010.J2 DRAB2A 1.22
#> 9 C18|S92.C1 C18 S92.C1 DRAB2A 4.62
#> 10 C20|O11.F2 C20 O11.F2 DRAB2A -1.33
#> 11 C20|S93.C2 C20 S93.C2 DRAB2A 3.00
#> 12 C21|010.J2 C21 010.J2 DRAB2A 1.25
#> 13 C18|O11.F2 C18 O11.F2 FOXP2 3.32
#> 14 C18|010.J2 C18 010.J2 FOXP2 -0.31
#> 15 C18|S92.C1 C18 S92.C1 FOXP2 1.43
#> 16 C20|O11.F2 C20 O11.F2 FOXP2 3.14
#> 17 C20|S93.C2 C20 S93.C2 FOXP2 -2.93
#> 18 C21|010.J2 C21 010.J2 FOXP2 0.013
现在,您可以计算每个组的平均值。您需要附加列,因此可以使用dplyr::mutate()
。
使用spread(key, value)
,您可以返回原始格式。
gene %>%
gather(-Cell_Cluster, key = key, value = value) %>%
separate(Cell_Cluster, into = c("cell", "cluster"),
sep = "\\|", remove = FALSE) %>%
group_by(cell) %>% # group by cell column
mutate(M = mean(value)) %>% # make mean column
spread(key, value) %>%
ungroup() %>% # do not need cell and cluster column, so remove them
select(-cell, -cluster)
#> # A tibble: 6 x 5
#> Cell_Cluster M ARB2 DRAB2A FOXP2
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 C18|010.J2 1.39 0.215 1.22 -0.31
#> 2 C18|O11.F2 1.39 2.23 0.315 3.32
#> 3 C18|S92.C1 1.39 -0.562 4.62 1.43
#> 4 C20|O11.F2 0.315 1.15 -1.33 3.14
#> 5 C20|S93.C2 0.315 -1.14 3.00 -2.93
#> 6 C21|010.J2 1.13 2.12 1.25 0.013
您可以看到M
列,该列已计算出每个基因组。
data.table
方法基因数据可能很大,因此data.table
可能更适合实施。
tidyr::gather()
代替data.table::melt()
id.vars
variable.name
tidyr::separate()
代替data.table::tstrsplit()
\\|
,请添加perl = TRUE
。tidyr::spread()
代替data.table::dcast()
value.var
一次全部
gene %>%
data.table() %>%
melt(id.vars = "Cell_Cluster", variable.name = "key") %>% # gather
.[,
c("cell", "cluster") := tstrsplit(Cell_Cluster, split = "\\|", perl = TRUE)] %>% # split Cell_Cluster
.[,
M := mean(value), # average value column
by = cell] %>% # group by cell
dcast(Cell_Cluster + M ~ key, value.var = "value") # spread
#> Cell_Cluster M ARB2 DRAB2A FOXP2
#> 1: C18|010.J2 1.387 0.215 1.215 -0.310
#> 2: C18|O11.F2 1.387 2.234 0.315 3.325
#> 3: C18|S92.C1 1.387 -0.562 4.624 1.426
#> 4: C20|O11.F2 0.315 1.150 -1.326 3.135
#> 5: C20|S93.C2 0.315 -1.135 3.001 -2.932
#> 6: C21|010.J2 1.129 2.125 1.250 0.013
此data.table
会更快。
microbenchmark::microbenchmark(
DPLYR = {
gene %>%
gather(-Cell_Cluster, key = key, value = value) %>%
separate(Cell_Cluster, into = c("cell", "cluster"),
sep = "\\|", remove = FALSE) %>%
group_by(cell) %>%
mutate(M = mean(value)) %>%
spread(key, value) %>%
ungroup() %>%
select(-cell, -cluster)
},
DATATABLE = {
gene %>%
data.table() %>%
melt(id.vars = "Cell_Cluster", variable.name = "key") %>%
.[,
c("cell", "cluster") := tstrsplit(Cell_Cluster, split = "\\|", perl = TRUE)] %>%
.[,
M := mean(value),
by = cell] %>%
dcast(Cell_Cluster + M ~ key, value.var = "value")
},
times = 50
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> DPLYR 8.55 10.15 11.7 11.39 12.53 20.22 50
#> DATATABLE 3.39 3.94 4.8 4.77 5.46 7.69 50