我有一个带有多列的data.frame。我有一个独特序列的列(序列),我想与该data.frame的下一个版本进行比较,并检查它们有多少肽,并检查这个数字是否增加或减少。
我从数据库中获取此data.frame,但问题是此数据库在每个版本中生成新的随机序列位置(参见2º发布)。
j
如果在每个版本中将新序列放在列的末尾,那么使用重复函数我没有任何问题,但不幸的是,这是随机完成的。
这里有一个例子:
1º发布:
1ºRelease
ID | sequence | ... | Peptides | nºproject
1 | atggggg | ... | 65 | project
2 | tgatgat | ... | 3 | project
3 | actgat | ... | 32 | project
4 | atgtagtt | ... | 25 | project
5 | ttttaaat | ... | 32 | project
2ºrelease
ID | sequence | ... | Peptides | nºproject
1 | atggggg | ... | 66 | project
2 | tgatgat | ... | 5 | project
3 | actgat | ... | 36 | project
4 | ATTTGGGG | ... | 26 | project *** New one ***
5 | ATTGATGA | ... | 32 | project *** New one ***
6 | atgtagtt | ... | 47 | project
7 | ttttaaat | ... | 38 | project
2º发布:
df <- structure(list(ID = structure(c(1L, 2L, 3L, 4L, 5L),
.Label = c("1", "2", "3", "4" ,"5") ),
sequence = structure(c(1L,2L, 3L, 4L, 5L),
.Label = c(" actgat "," atagattg ", " atatagag ", " atggggg ", " atgtagtt "), class = "factor"),
peptides = structure(c(1L, 2L, 3L, 4L, 5L),
.Label = c(" 54 ", " 84 ", " 32 ", " 36 ", "12"),
class = "factor"), n_project = structure(c(1L, 1L, 1L, 1L, 1L),
.Label = " project ", class = "factor")), .Names = c("ID", "sequence", "peptides", "n_project"), class = "data.frame", row.names = c(NA, -5L))
答案 0 :(得分:4)
首先将您的肽计数转换为数字(它们是带有数字字符标签的因素,这有点混乱):
> df$peptides=as.numeric(as.character(df$peptides))
> df2$peptides=as.numeric(as.character(df2$peptides))
左连接会将新数据与旧数据匹配:
> require(dplyr)
> left_join(df, df2, c("sequence"="sequence"))
ID.x sequence peptides.x n_project.x ID.y peptides.y n_project.y
1 1 actgat 54 project 1 56 project
2 2 atagattg 84 project 2 85 project
3 3 atatagag 32 project 5 31 project
4 4 atggggg 36 project 6 36 project
5 5 atgtagtt 12 project 7 15 project
Warning message:
In left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y) :
joining factors with different levels, coercing to character vector
忽略警告。左连接和过滤器将找到肽编号增加的位置:
> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y>peptides.x)
ID.x sequence peptides.x n_project.x ID.y peptides.y n_project.y
1 1 actgat 54 project 1 56 project
2 2 atagattg 84 project 2 85 project
3 5 atgtagtt 12 project 7 15 project
将其另存为新数据框。
作为支票,已减少或未改变:
> filter(left_join(df, df2, c("sequence"="sequence")), peptides.y<=peptides.x)
ID.x sequence peptides.x n_project.x ID.y peptides.y n_project.y
1 3 atatagag 32 project 5 31 project
2 4 atggggg 36 project 6 36 project
答案 1 :(得分:4)
@ Spacedman的解决方案,但data.table
:
library("data.table")
setDT(df, key = 'sequence')
setDT(df2, key = 'sequence')
df2[df]
或者作为一行(可以使用最新版本的data.table):
library("data.table")
setDT(df2)[df, on="sequence"]
答案 2 :(得分:3)
由于您有一个共同的密钥,因此您可以使用join
。
在tidyverse
中看起来像这样:
库(tidyverse)
df %>%
full_join(df2, by = "sequence", suffix = c(".1", ".2")) %>%
# Fix data to convert to character and numeric
mutate_each(funs(as.numeric(as.character(.))), starts_with("pept")) %>%
# See difference
mutate(change = peptides.2 - peptides.1)
#> Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
#> factors with different levels, coercing to character vector
#> ID.1 sequence peptides.1 n_project.1 ID.2 peptides.2 n_project.2 change
#> 1 1 actgat 54 project 1 56 project 2
#> 2 2 atagattg 84 project 2 85 project 1
#> 3 3 atatagag 32 project 5 31 project -1
#> 4 4 atggggg 36 project 6 36 project 0
#> 5 5 atgtagtt 12 project 7 15 project 3
#> 6 NA TATATCC NA <NA> 3 76 project NA
#> 7 NA TTTTAAAT NA <NA> 4 98 project NA
我们看到full_join
:
df
与df2
之间的匹配程度。df2
中的新行(肽的值为NA
)在这种情况下,我假设您的sequence
数据区分大小写。
您也可以使用merge
在基础R中执行此操作,但我更喜欢上面的tidyverse
语法。
merge(df, df2, by = "sequence", all = T)
#> sequence ID.x peptides.x n_project.x ID.y peptides.y n_project.y
#> 1 actgat 1 54 project 1 56 project
#> 2 atagattg 2 84 project 2 85 project
#> 3 atatagag 3 32 project 5 31 project
#> 4 atggggg 4 36 project 6 36 project
#> 5 atgtagtt 5 12 project 7 15 project
#> 6 TATATCC NA <NA> <NA> 3 76 project
#> 7 TTTTAAAT NA <NA> <NA> 4 98 project