比较两列并检查其他列的值是增加还是减少

时间:2016-10-11 11:48:12

标签: r

我有一个带有多列的data.frame。我有一个独特序列的列(序列),我想与该data.frame的下一个版本进行比较,并检查它们有多少肽,并检查这个数字是否增加或减少。

我从数据库中获取此data.frame,但问题是此数据库在每个版本中生成新的随机序列位置(参见2º发布)。

j

如果在每个版本中将新序列放在列的末尾,那么使用重复函数我没有任何问题,但不幸的是,这是随机完成的。

这里有一个例子:

1º发布:

1ºRelease
    ID  | sequence | ... | Peptides | nºproject
    1 | atggggg  | ... | 65       | project 
    2 | tgatgat  | ... | 3        | project 
    3 | actgat   | ... | 32       | project 
    4 | atgtagtt | ... | 25       | project 
    5 | ttttaaat | ... | 32       | project 



2ºrelease
    ID  | sequence | ... | Peptides | nºproject
    1 | atggggg  | ... | 66       | project 
    2 | tgatgat  | ... | 5        | project 
    3 | actgat   | ... | 36       | project 
    4 | ATTTGGGG | ... | 26       | project *** New one ***
    5 | ATTGATGA | ... | 32       | project *** New one ***
    6 | atgtagtt | ... | 47       | project 
    7 | ttttaaat | ... | 38       | project 

2º发布:

df <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L), 
.Label = c("1", "2", "3", "4" ,"5") ), 
sequence = structure(c(1L,2L, 3L, 4L, 5L), 
.Label = c(" actgat   "," atagattg ", " atatagag ", " atggggg  ", " atgtagtt "), class = "factor"), 
peptides = structure(c(1L, 2L, 3L, 4L, 5L), 
.Label = c(" 54  ", " 84  ",  " 32  ", " 36  ", "12"),
class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L), 
.Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA,  -5L))

3 个答案:

答案 0 :(得分:4)

首先将您的肽计数转换为数字(它们是带有数字字符标签的因素,这有点混乱):

> df$peptides=as.numeric(as.character(df$peptides))
> df2$peptides=as.numeric(as.character(df2$peptides))

左连接会将新数据与旧数据匹配:

> require(dplyr)
> left_join(df, df2, c("sequence"="sequence"))
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    1  actgat            54    project     1         56    project 
2    2  atagattg          84    project     2         85    project 
3    3  atatagag          32    project     5         31    project 
4    4  atggggg           36    project     6         36    project 
5    5  atgtagtt          12    project     7         15    project 
Warning message:
In left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
  joining factors with different levels, coercing to character vector

忽略警告。左连接和过滤器将找到肽编号增加的位置:

> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y>peptides.x)
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    1  actgat            54    project     1         56    project 
2    2  atagattg          84    project     2         85    project 
3    5  atgtagtt          12    project     7         15    project 

将其另存为新数据框。

作为支票,已减少或未改变:

> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y<=peptides.x)
  ID.x   sequence peptides.x n_project.x ID.y peptides.y n_project.y
1    3  atatagag          32    project     5         31    project 
2    4  atggggg           36    project     6         36    project 

答案 1 :(得分:4)

@ Spacedman的解决方案,但data.table

library("data.table")
setDT(df, key = 'sequence')
setDT(df2, key = 'sequence')
df2[df]

或者作为一行(可以使用最新版本的data.table):

library("data.table")
setDT(df2)[df, on="sequence"]

答案 2 :(得分:3)

由于您有一个共同的密钥,因此您可以使用join

tidyverse中看起来像这样:

库(tidyverse)

df %>% 
  full_join(df2, by = "sequence", suffix = c(".1", ".2")) %>%
  # Fix data to convert to character and numeric
  mutate_each(funs(as.numeric(as.character(.))), starts_with("pept")) %>%
  # See difference
  mutate(change = peptides.2 - peptides.1)

#> Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
#> factors with different levels, coercing to character vector
#>   ID.1   sequence peptides.1 n_project.1 ID.2 peptides.2 n_project.2  change
#> 1    1  actgat            54    project     1         56    project       2
#> 2    2  atagattg          84    project     2         85    project       1
#> 3    3  atatagag          32    project     5         31    project      -1
#> 4    4  atggggg           36    project     6         36    project       0
#> 5    5  atgtagtt          12    project     7         15    project       3
#> 6   NA   TATATCC          NA        <NA>    3         76    project      NA
#> 7   NA  TTTTAAAT          NA        <NA>    4         98    project      NA

我们看到full_join

  1. dfdf2之间的匹配程度。
  2. df2中的新行(肽的值为NA
  3. 肽随时间的变化。
  4. 在这种情况下,我假设您的sequence数据区分大小写。

    基础R

    您也可以使用merge在基础R中执行此操作,但我更喜欢上面的tidyverse语法。

    merge(df, df2, by = "sequence", all = T)
    #>     sequence ID.x peptides.x n_project.x ID.y peptides.y n_project.y
    #> 1  actgat       1       54      project     1       56      project 
    #> 2  atagattg     2       84      project     2       85      project 
    #> 3  atatagag     3       32      project     5       31      project 
    #> 4  atggggg      4       36      project     6       36      project 
    #> 5  atgtagtt     5         12    project     7         15    project 
    #> 6   TATATCC    NA       <NA>        <NA>    3         76    project 
    #> 7  TTTTAAAT    NA       <NA>        <NA>    4         98    project