根据另一个R中的共享项筛选一列中的项目

时间:2017-12-29 11:37:06

标签: r dataframe

我有一个表,每个样本都有一个唯一的标识符,但也有一个节标识符。我想提取每个部分的所有距离比较​​(这个数据来自第二个表)

例如表1

Sample    Section
1         1
2         1
3         1
4         2
5         2
6         3

表2

sample    sample    distance
1         2         10
1         3         1
1         4         2
2         3         5
2         4         10
3         4         11

所以我想要的输出是一个列表 距离:[1 vs 2],[1 vs 3],[2 vs 3],[4 vs 5] - 即表2中与表2中所有部分的样本的所有距离比较​​

我开始尝试使用嵌套for循环来做这件事,但它很快就变得凌乱了......有任何想法可以做到这一点吗?

2 个答案:

答案 0 :(得分:1)

使用的解决方案。

我们可以先创建一个数据框,显示每个部分中的样本组合。

library(dplyr)

table1_cross <- full_join(table1, table1, by = "Section") %>%    # Full join by Section
  filter(Sample.x != Sample.y) %>%                               # Remove records with same samples
  rowwise() %>%
  mutate(Sample.all = toString(sort(c(Sample.x, Sample.y)))) %>% # Create a column showing the combination between Sample.x and Sample.y
  ungroup() %>%
  distinct(Sample.all, .keep_all = TRUE) %>%                     # Remove duplicates in Sample.all
  select(Sample1 = Sample.x, Sample2 = Sample.y, Section)
table1_cross
# # A tibble: 4 x 3
#   Sample1 Sample2 Section
#     <int>   <int>   <int>
# 1       1       2       1
# 2       1       3       1
# 3       2       3       1
# 4       4       5       2

然后我们可以table2过滤table1_crosstable3是最终输出。

table3 <- table2 %>%                                     
  semi_join(table1_cross, by = c("Sample1", "Sample2")) # Filter table2 based on table1_corss

table3
#   Sample1 Sample2 distance
# 1       1       2       10
# 2       1       3        1
# 3       2       3        5

数据

table1 <- read.table(text = "Sample    Section
1         1
                     2         1
                     3         1
                     4         2
                     5         2
                     6         3",
                     header = TRUE, stringsAsFactors = FALSE)

table2 <- read.table(text = "Sample1    Sample2    distance
1         2         10
                     1         3         1
                     1         4         2
                     2         3         5
                     2         4         10
                     3         4         11",
                     header = TRUE, stringsAsFactors = FALSE)

答案 1 :(得分:0)

OP已要求查找table2的所有距离比较​​,以查找共享table1中某个部分的样本。

这可以通过两种不同的方法来实现:

  1. Sample1中查找Sample2table1各自的相应部分ID,并仅保留部分ID匹配的table2行。
  2. table1中的每个部分创建示例ID的所有唯一组合,并在table2中找到相应的条目(如果有)。
  3. 方法1

    基础R

    tmp <- merge(table2, table1, by.x = "Sample1", by.y = "Sample")
    tmp <- merge(tmp, table1, by.x = "Sample2", by.y = "Sample")
    tmp[tmp$Section.x == tmp$Section.y, c("Sample2", "Sample1", "distance")]
    
      Sample2 Sample1 distance
    1       2       1       10
    2       3       1        1
    3       3       2        5
    

    dplyr

    library(dplyr)
    table2 %>% 
      inner_join(table1, by = c(Sample1 = "Sample")) %>% 
      inner_join(table1, by = c(Sample2 = "Sample")) %>% 
      filter(Section.x == Section.y) %>% 
      select(-Section.x, -Section.y)
    
      Sample1 Sample2 distance
    1       1       2       10
    2       1       3        1
    3       2       3        5
    

    data.table

    使用嵌套连接

    library(data.table)
    tmp <- setDT(table1)[setDT(table2), on = .(Sample == Sample1)]
    table1[tmp, on = .(Sample == Sample2)][
      Section == i.Section, .(Sample1 = i.Sample, Sample2 = Sample, distance)]
    

    使用merge()和链式data.table表达式

    tmp <- merge(setDT(table2), setDT(table1), by.x = "Sample1", by.y = "Sample")
    merge(tmp, table1, by.x = "Sample2", by.y = "Sample")[
      Section.x == Section.y, -c("Section.x", "Section.y")]
    
       Sample2 Sample1 distance
    1:       2       1       10
    2:       3       1        1
    3:       3       2        5
    

    方法2

    基础R

    table1_cross <- do.call(rbind, lst <- lapply(
      split(table1, table1$Section), 
      function(x) as.data.frame(combinat::combn2(x$Sample))))
    merge(table2, table1_cross, by.x = c("Sample1", "Sample2"), by.y = c("V1", "V2"))
    

    这里使用了方便的combn2(x)函数,它生成x元素的所有组合,一次两个,例如,

    combinat::combn2(1:3)
    
         [,1] [,2]
    [1,]    1    2
    [2,]    1    3
    [3,]    2    3
    

    繁琐的部分是将combn2()分别应用于每个Section组,并创建一个可以合并的数据框架。

    dplyr

    这是www's approach

    的精简版
    full_join(table1, table1, by = "Section") %>%
      filter(Sample.x < Sample.y) %>% 
      semi_join(x = table2, y = ., by = c(Sample1 = "Sample.x", Sample2 = "Sample.y"))
    

    非等自我加入

    library(data.table)
    setDT(table2)[setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
                  .(Section, Sample1 = x.Sample, Sample2 = i.Sample)],
                  on = .(Sample1, Sample2), nomatch = 0L]
    
       Sample1 Sample2 distance Section
    1:       1       2       10       1
    2:       1       3        1       1
    3:       2       3        5       1
    

    此处,非equi联接用于为每个Sample创建Section的唯一组合。这相当于使用combn2()

    setDT(table1)[table1, on = .(Section, Sample < Sample), allow = TRUE,
                  .(Section, Sample1 = x.Sample, Sample2 = i.Sample)]
    
       Section Sample1 Sample2
    1:       1      NA       1
    2:       1       1       2
    3:       1       1       3
    4:       1       2       3
    5:       2      NA       4
    6:       2       4       5
    7:       3      NA       6
    

    NA行将在最终加入中删除。