使用R

时间:2016-10-21 15:17:55

标签: r dplyr

我正在努力想出一个似乎是一个相当简单的问题的工作解决方案。我有一个包含数据和因素的数据框,我想使用这些因素来决定哪些数据点需要从其他数据点中减去,以产生一个比较值的新数据框。

这里是数据框的样子:

str(means)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 32 obs. of  5 variables:
 $ rat          : Factor w/ 8 levels "Rat1","Rat2",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ gene         : Factor w/ 4 levels "gene1","gene2",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ gene_category: Factor w/ 2 levels "control","experimental": 2 2 1 1 2 2 1 1 2 2 ...
 $ timepoint1   : num  23.4 18.3 42.1 40.1 25.3 ...
 $ timepoint2   : num  23.5 18.4 41.5 39.9 22.8 ...
> head(means)
Source: local data frame [6 x 5]
Groups: rat, gene [6]

 rat   gene gene_category timepoint1 timepoint2
(fctr) (fctr)        (fctr)      (dbl)      (dbl)
1   Rat1  gene1  experimental   23.36667   23.49667
2   Rat1  gene2  experimental   18.26000   18.38000
3   Rat1  gene3       control   42.05500   41.45000
4   Rat1  gene4       control   40.08667   39.89500
5   Rat2  gene1  experimental   25.29333   22.83000
6   Rat2  gene2  experimental   19.72667   19.19333

对于每只大鼠(总共8只大鼠),我想减去对照组[/ p]。基因值(基因3和4)来自实验'基因值(基因1和2)。我需要迭代地这样做,因此每个实验基因值必须从中减去每个对照基因值(在每只大鼠内,但不在大鼠之间)。应该为每个时间点列完成上述操作。

我一直在使用dplyr摆弄解决方案,我已经分组但我无法弄清楚如何做其余事情:

diffs <- means %>% group_by(rat, gene, gene_category) %>% here_is_where_i_don't_know_what_to_do)

There is a solution here to a similar problem here但我认为它会给我所有成对操作成为可能,并且这不是我想要的。它也只涉及两个因素,而我需要考虑三个因素。

Here's another solution to a similar problem,但有一些事情使它不太理想。它仅处理一个因素,我不确定它如何应用于具有三个因子和两个数据向量的数据集。

我知道这个问题在进行成对比较以确定统计显着性(多个t检验,ANOVA,MANOVA等)时得到解决,但我熟悉的包/基本统计函数是这些测试将这个基本操作保持在引擎盖下。我想要一个简单的解决方案,使用基本R或dplyr / plyr / reshape2等尽可能少的循环。

2 个答案:

答案 0 :(得分:3)

我认为解决方案将涉及生成您想要的比较,然后将其传递给标准评估mutate_,而不是与group_bysummarize作斗争。

首先,这里是读入的数据(注意,为rat2添加基因3/4):

means <-
  read.table(text =
" rat   gene gene_category timepoint1 timepoint2
1   Rat1  gene1  experimental   23.36667   23.49667
2   Rat1  gene2  experimental   18.26000   18.38000
3   Rat1  gene3       control   42.05500   41.45000
4   Rat1  gene4       control   40.08667   39.89500
5   Rat2  gene1  experimental   25.29333   22.83000
6   Rat2  gene2  experimental   19.72667   19.19333
7   Rat2  gene3       control   42.05500   41.45000
8   Rat2  gene4       control   40.08667   39.89500")

接下来,在每个班级中生成一组基因:

geneLists <-
  means %>%
  {split(.$gene, .$`gene_category`)} %>%
  lapply(unique) %>%
  lapply(as.character) %>%
  lapply(function(x){paste0("`", x, "`")})

请注意,反引号“`”用于防止可能无效的列名称(例如,带空格的内容)。这给出了:

$control
[1] "`gene3`" "`gene4`"

$experimental
[1] "`gene1`" "`gene2`"

然后,将所需的比较粘贴在一起:

colsToCreate <-
  outer(geneLists[["experimental"]]
        , geneLists[["control"]]
        , paste, sep = " - ") %>%
  as.character()

,并提供:

[1] "`gene1` - `gene3`" "`gene2` - `gene3`" "`gene1` - `gene4`" "`gene2` - `gene4`"

然后,使用tidyr传播数据,每只老鼠生成一行。请注意,如果您要传播timepoint1timepoint2,您可能需要先gather(将两个时间放在一列中),然后创建一个包含时间和基因的id列,然后spread使用该单个id列。这也需要更改colsToCreate构造。

传播后,传递列的向量来生成,你应该拥有你想要的东西:

means %>%
  select(rat, gene, timepoint1) %>%
  spread(gene, timepoint1) %>%
  mutate_(.dots = colsToCreate)

瞧:

   rat    gene1    gene2  gene3    gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 23.36667 18.26000 42.055 40.08667     -18.68833     -23.79500     -16.72000     -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667     -16.76167     -22.32833     -14.79334     -20.36000

实际上,获得两个时间点比我想象的要容易得多:

means %>%
  select(-gene_category) %>%
  gather("timepoint", "value", starts_with("timepoint")) %>%
  spread(gene, value) %>%
  mutate_(.dots = colsToCreate)

给出:

   rat  timepoint    gene1    gene2  gene3    gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 timepoint1 23.36667 18.26000 42.055 40.08667     -18.68833     -23.79500     -16.72000     -21.82667
2 Rat1 timepoint2 23.49667 18.38000 41.450 39.89500     -17.95333     -23.07000     -16.39833     -21.51500
3 Rat2 timepoint1 25.29333 19.72667 42.055 40.08667     -16.76167     -22.32833     -14.79334     -20.36000
4 Rat2 timepoint2 22.83000 19.19333 41.450 39.89500     -18.62000     -22.25667     -17.06500     -20.70167

另请注意,您可以命名包含列计算公式的向量,例如:

colsToCreate2 <-
  setNames(colsToCreate
           , c("nameA", "nameB", "nameC", "nameD"))

means %>%
  select(rat, gene, timepoint1) %>%
  spread(gene, timepoint1) %>%
  mutate_(.dots = colsToCreate2)

给出:

   rat    gene1    gene2  gene3    gene4     nameA     nameB     nameC     nameD
1 Rat1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000

我不知道为什么,但这个问题让我很兴奋,我想完成这个想法。在这里,我gather比较回到长格式,然后将mutate时间点转换为数字parse_numberreadrseparate将比较的基因分离成单独的列允许有效访问和过滤。请注意,重复使用每个基因会消除独立性的假设,因此对这些做出统计数据,而不需要对控制进行非常仔细的思考。

longForm <-
  means %>%
  select(-gene_category) %>%
  gather("timepoint", "value", starts_with("timepoint")) %>%
  spread(gene, value) %>%
  mutate_(.dots = colsToCreate) %>%
  select_(.dots = paste0("-",unlist(geneLists))) %>%
  gather(Comparison, Difference, -rat, -timepoint) %>%
  mutate(time = parse_number(timepoint)) %>%
  separate(Comparison, c("exp_Gene", "cont_Gene"), " - ")

head(longForm)

给出

   rat  timepoint exp_Gene cont_Gene Difference time
1 Rat1 timepoint1    gene1     gene3  -18.68833    1
2 Rat1 timepoint2    gene1     gene3  -17.95333    2
3 Rat2 timepoint1    gene1     gene3  -16.76167    1
4 Rat2 timepoint2    gene1     gene3  -18.62000    2
5 Rat1 timepoint1    gene2     gene3  -23.79500    1
6 Rat1 timepoint2    gene2     gene3  -23.07000    2

然后,我们可以绘制结果:

longForm %>%
  ggplot(aes(x = time
             , y = Difference
             , col = rat)) +
  geom_line() +
  facet_grid(exp_Gene ~ cont_Gene)

enter image description here

答案 1 :(得分:3)

以下是使用data.table的{​​{3}}(1.9.7 +)的解决方案:

library(data.table)
setDT(means)

# join on rat being same and gene categories not being same, discard unmatched rows
# then extract interesting columns
means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
      .(rat, gene.exp = gene, gene.ctrl = i.gene,
        timediff1 = timepoint1 - i.timepoint1, timediff2 = timepoint2 - i.timepoint2)]
#    rat gene.exp gene.ctrl timediff1 timediff2
#1: Rat1    gene1     gene3 -18.68833 -17.95333
#2: Rat1    gene2     gene3 -23.79500 -23.07000
#3: Rat1    gene1     gene4 -16.72000 -16.39833
#4: Rat1    gene2     gene4 -21.82667 -21.51500
#5: Rat2    gene1     gene3 -16.76167 -18.62000
#6: Rat2    gene2     gene3 -22.32833 -22.25667
#7: Rat2    gene1     gene4 -14.79334 -17.06500
#8: Rat2    gene2     gene4 -20.36000 -20.70167

如果你想推广到任意数量的&#34; timepoint&#34;列:

nm = grep("timepoint", names(means), value = T)

means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
      c(.(rat = rat, gene.exp = gene, gene.ctrl = i.gene),
        setDT(mget(nm)) - mget(paste0('i.', nm)))]