我有一个表,每个样本都有一个唯一的标识符,但也有一个节标识符。我想提取每个部分的所有距离比较(这个数据来自第二个表)
例如表1
Sample Section
1 1
2 1
3 1
4 2
5 2
6 3
表2
sample sample distance
1 2 10
1 3 1
1 4 2
2 3 5
2 4 10
3 4 11
所以我想要的输出是一个列表 距离:[1 vs 2],[1 vs 3],[2 vs 3],[4 vs 5] - 即表2中与表2中所有部分的样本的所有距离比较
我开始尝试使用嵌套for循环来做这件事,但它很快就变得凌乱了......有任何想法可以做到这一点吗?
答案 0 :(得分:1)
使用dplyr的解决方案。
我们可以先创建一个数据框,显示每个部分中的样本组合。
library(dplyr)
table1_cross <- full_join(table1, table1, by = "Section") %>% # Full join by Section
filter(Sample.x != Sample.y) %>% # Remove records with same samples
rowwise() %>%
mutate(Sample.all = toString(sort(c(Sample.x, Sample.y)))) %>% # Create a column showing the combination between Sample.x and Sample.y
ungroup() %>%
distinct(Sample.all, .keep_all = TRUE) %>% # Remove duplicates in Sample.all
select(Sample1 = Sample.x, Sample2 = Sample.y, Section)
table1_cross
# # A tibble: 4 x 3
# Sample1 Sample2 Section
# <int> <int> <int>
# 1 1 2 1
# 2 1 3 1
# 3 2 3 1
# 4 4 5 2
然后我们可以table2
过滤table1_cross
。 table3
是最终输出。
table3 <- table2 %>%
semi_join(table1_cross, by = c("Sample1", "Sample2")) # Filter table2 based on table1_corss
table3
# Sample1 Sample2 distance
# 1 1 2 10
# 2 1 3 1
# 3 2 3 5
数据强>
table1 <- read.table(text = "Sample Section
1 1
2 1
3 1
4 2
5 2
6 3",
header = TRUE, stringsAsFactors = FALSE)
table2 <- read.table(text = "Sample1 Sample2 distance
1 2 10
1 3 1
1 4 2
2 3 5
2 4 10
3 4 11",
header = TRUE, stringsAsFactors = FALSE)
答案 1 :(得分:0)
OP已要求查找table2
的所有距离比较,以查找共享table1
中某个部分的样本。
这可以通过两种不同的方法来实现:
Sample1
中查找Sample2
和table1
各自的相应部分ID,并仅保留部分ID匹配的table2
行。table1
中的每个部分创建示例ID的所有唯一组合,并在table2
中找到相应的条目(如果有)。tmp <- merge(table2, table1, by.x = "Sample1", by.y = "Sample")
tmp <- merge(tmp, table1, by.x = "Sample2", by.y = "Sample")
tmp[tmp$Section.x == tmp$Section.y, c("Sample2", "Sample1", "distance")]
Sample2 Sample1 distance 1 2 1 10 2 3 1 1 3 3 2 5
dplyr
library(dplyr)
table2 %>%
inner_join(table1, by = c(Sample1 = "Sample")) %>%
inner_join(table1, by = c(Sample2 = "Sample")) %>%
filter(Section.x == Section.y) %>%
select(-Section.x, -Section.y)
Sample1 Sample2 distance 1 1 2 10 2 1 3 1 3 2 3 5
data.table
使用嵌套连接
library(data.table)
tmp <- setDT(table1)[setDT(table2), on = .(Sample == Sample1)]
table1[tmp, on = .(Sample == Sample2)][
Section == i.Section, .(Sample1 = i.Sample, Sample2 = Sample, distance)]
使用merge()和链式data.table表达式
tmp <- merge(setDT(table2), setDT(table1), by.x = "Sample1", by.y = "Sample")
merge(tmp, table1, by.x = "Sample2", by.y = "Sample")[
Section.x == Section.y, -c("Section.x", "Section.y")]
Sample2 Sample1 distance 1: 2 1 10 2: 3 1 1 3: 3 2 5
table1_cross <- do.call(rbind, lst <- lapply(
split(table1, table1$Section),
function(x) as.data.frame(combinat::combn2(x$Sample))))
merge(table2, table1_cross, by.x = c("Sample1", "Sample2"), by.y = c("V1", "V2"))
这里使用了方便的combn2(x)
函数,它生成x元素的所有组合,一次两个,例如,
combinat::combn2(1:3)
[,1] [,2] [1,] 1 2 [2,] 1 3 [3,] 2 3
繁琐的部分是将combn2()
分别应用于每个Section
组,并创建一个可以合并的数据框架。
dplyr
full_join(table1, table1, by = "Section") %>%
filter(Sample.x < Sample.y) %>%
semi_join(x = table2, y = ., by = c(Sample1 = "Sample.x", Sample2 = "Sample.y"))
library(data.table)
setDT(table2)[setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
.(Section, Sample1 = x.Sample, Sample2 = i.Sample)],
on = .(Sample1, Sample2), nomatch = 0L]
Sample1 Sample2 distance Section 1: 1 2 10 1 2: 1 3 1 1 3: 2 3 5 1
此处,非equi联接用于为每个Sample
创建Section
的唯一组合。这相当于使用combn2()
:
setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
.(Section, Sample1 = x.Sample, Sample2 = i.Sample)]
Section Sample1 Sample2 1: 1 NA 1 2: 1 1 2 3: 1 1 3 4: 1 2 3 5: 2 NA 4 6: 2 4 5 7: 3 NA 6
NA
行将在最终加入中删除。