我正在努力想出一个似乎是一个相当简单的问题的工作解决方案。我有一个包含数据和因素的数据框,我想使用这些因素来决定哪些数据点需要从其他数据点中减去,以产生一个比较值的新数据框。
这里是数据框的样子:
str(means)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 32 obs. of 5 variables:
$ rat : Factor w/ 8 levels "Rat1","Rat2",..: 1 1 1 1 2 2 2 2 3 3 ...
$ gene : Factor w/ 4 levels "gene1","gene2",..: 1 2 3 4 1 2 3 4 1 2 ...
$ gene_category: Factor w/ 2 levels "control","experimental": 2 2 1 1 2 2 1 1 2 2 ...
$ timepoint1 : num 23.4 18.3 42.1 40.1 25.3 ...
$ timepoint2 : num 23.5 18.4 41.5 39.9 22.8 ...
> head(means)
Source: local data frame [6 x 5]
Groups: rat, gene [6]
rat gene gene_category timepoint1 timepoint2
(fctr) (fctr) (fctr) (dbl) (dbl)
1 Rat1 gene1 experimental 23.36667 23.49667
2 Rat1 gene2 experimental 18.26000 18.38000
3 Rat1 gene3 control 42.05500 41.45000
4 Rat1 gene4 control 40.08667 39.89500
5 Rat2 gene1 experimental 25.29333 22.83000
6 Rat2 gene2 experimental 19.72667 19.19333
对于每只大鼠(总共8只大鼠),我想减去对照组[/ p]。基因值(基因3和4)来自实验'基因值(基因1和2)。我需要迭代地这样做,因此每个实验基因值必须从中减去每个对照基因值(在每只大鼠内,但不在大鼠之间)。应该为每个时间点列完成上述操作。
我一直在使用dplyr摆弄解决方案,我已经分组但我无法弄清楚如何做其余事情:
diffs <- means %>% group_by(rat, gene, gene_category) %>% here_is_where_i_don't_know_what_to_do)
There is a solution here to a similar problem here但我认为它会给我所有成对操作成为可能,并且这不是我想要的。它也只涉及两个因素,而我需要考虑三个因素。
Here's another solution to a similar problem,但有一些事情使它不太理想。它仅处理一个因素,我不确定它如何应用于具有三个因子和两个数据向量的数据集。
我知道这个问题在进行成对比较以确定统计显着性(多个t检验,ANOVA,MANOVA等)时得到解决,但我熟悉的包/基本统计函数是这些测试将这个基本操作保持在引擎盖下。我想要一个简单的解决方案,使用基本R或dplyr / plyr / reshape2等尽可能少的循环。
答案 0 :(得分:3)
我认为解决方案将涉及生成您想要的比较,然后将其传递给标准评估mutate_
,而不是与group_by
和summarize
作斗争。
首先,这里是读入的数据(注意,为rat2添加基因3/4):
means <-
read.table(text =
" rat gene gene_category timepoint1 timepoint2
1 Rat1 gene1 experimental 23.36667 23.49667
2 Rat1 gene2 experimental 18.26000 18.38000
3 Rat1 gene3 control 42.05500 41.45000
4 Rat1 gene4 control 40.08667 39.89500
5 Rat2 gene1 experimental 25.29333 22.83000
6 Rat2 gene2 experimental 19.72667 19.19333
7 Rat2 gene3 control 42.05500 41.45000
8 Rat2 gene4 control 40.08667 39.89500")
接下来,在每个班级中生成一组基因:
geneLists <-
means %>%
{split(.$gene, .$`gene_category`)} %>%
lapply(unique) %>%
lapply(as.character) %>%
lapply(function(x){paste0("`", x, "`")})
请注意,反引号“`”用于防止可能无效的列名称(例如,带空格的内容)。这给出了:
$control
[1] "`gene3`" "`gene4`"
$experimental
[1] "`gene1`" "`gene2`"
然后,将所需的比较粘贴在一起:
colsToCreate <-
outer(geneLists[["experimental"]]
, geneLists[["control"]]
, paste, sep = " - ") %>%
as.character()
,并提供:
[1] "`gene1` - `gene3`" "`gene2` - `gene3`" "`gene1` - `gene4`" "`gene2` - `gene4`"
然后,使用tidyr
传播数据,每只老鼠生成一行。请注意,如果您要传播timepoint1
和timepoint2
,您可能需要先gather
(将两个时间放在一列中),然后创建一个包含时间和基因的id列,然后spread
使用该单个id列。这也需要更改colsToCreate
构造。
传播后,传递列的向量来生成,你应该拥有你想要的东西:
means %>%
select(rat, gene, timepoint1) %>%
spread(gene, timepoint1) %>%
mutate_(.dots = colsToCreate)
瞧:
rat gene1 gene2 gene3 gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000
实际上,获得两个时间点比我想象的要容易得多:
means %>%
select(-gene_category) %>%
gather("timepoint", "value", starts_with("timepoint")) %>%
spread(gene, value) %>%
mutate_(.dots = colsToCreate)
给出:
rat timepoint gene1 gene2 gene3 gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 timepoint1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat1 timepoint2 23.49667 18.38000 41.450 39.89500 -17.95333 -23.07000 -16.39833 -21.51500
3 Rat2 timepoint1 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000
4 Rat2 timepoint2 22.83000 19.19333 41.450 39.89500 -18.62000 -22.25667 -17.06500 -20.70167
另请注意,您可以命名包含列计算公式的向量,例如:
colsToCreate2 <-
setNames(colsToCreate
, c("nameA", "nameB", "nameC", "nameD"))
means %>%
select(rat, gene, timepoint1) %>%
spread(gene, timepoint1) %>%
mutate_(.dots = colsToCreate2)
给出:
rat gene1 gene2 gene3 gene4 nameA nameB nameC nameD
1 Rat1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000
我不知道为什么,但这个问题让我很兴奋,我想完成这个想法。在这里,我gather
比较回到长格式,然后将mutate
时间点转换为数字parse_number
从readr
和separate
将比较的基因分离成单独的列允许有效访问和过滤。请注意,重复使用每个基因会消除独立性的假设,因此不对这些做出统计数据,而不需要对控制进行非常仔细的思考。
longForm <-
means %>%
select(-gene_category) %>%
gather("timepoint", "value", starts_with("timepoint")) %>%
spread(gene, value) %>%
mutate_(.dots = colsToCreate) %>%
select_(.dots = paste0("-",unlist(geneLists))) %>%
gather(Comparison, Difference, -rat, -timepoint) %>%
mutate(time = parse_number(timepoint)) %>%
separate(Comparison, c("exp_Gene", "cont_Gene"), " - ")
head(longForm)
给出
rat timepoint exp_Gene cont_Gene Difference time
1 Rat1 timepoint1 gene1 gene3 -18.68833 1
2 Rat1 timepoint2 gene1 gene3 -17.95333 2
3 Rat2 timepoint1 gene1 gene3 -16.76167 1
4 Rat2 timepoint2 gene1 gene3 -18.62000 2
5 Rat1 timepoint1 gene2 gene3 -23.79500 1
6 Rat1 timepoint2 gene2 gene3 -23.07000 2
然后,我们可以绘制结果:
longForm %>%
ggplot(aes(x = time
, y = Difference
, col = rat)) +
geom_line() +
facet_grid(exp_Gene ~ cont_Gene)
答案 1 :(得分:3)
以下是使用data.table
的{{3}}(1.9.7 +)的解决方案:
library(data.table)
setDT(means)
# join on rat being same and gene categories not being same, discard unmatched rows
# then extract interesting columns
means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
.(rat, gene.exp = gene, gene.ctrl = i.gene,
timediff1 = timepoint1 - i.timepoint1, timediff2 = timepoint2 - i.timepoint2)]
# rat gene.exp gene.ctrl timediff1 timediff2
#1: Rat1 gene1 gene3 -18.68833 -17.95333
#2: Rat1 gene2 gene3 -23.79500 -23.07000
#3: Rat1 gene1 gene4 -16.72000 -16.39833
#4: Rat1 gene2 gene4 -21.82667 -21.51500
#5: Rat2 gene1 gene3 -16.76167 -18.62000
#6: Rat2 gene2 gene3 -22.32833 -22.25667
#7: Rat2 gene1 gene4 -14.79334 -17.06500
#8: Rat2 gene2 gene4 -20.36000 -20.70167
如果你想推广到任意数量的&#34; timepoint&#34;列:
nm = grep("timepoint", names(means), value = T)
means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
c(.(rat = rat, gene.exp = gene, gene.ctrl = i.gene),
setDT(mget(nm)) - mget(paste0('i.', nm)))]