按两者共有的组合并数据帧

时间:2016-12-07 16:21:32

标签: r merge dplyr

我有两个不同采样器采集的龙虾蛋大小数据集,用于评估测量的可变性。每个采样器测量来自无数龙虾的~50个鸡蛋\龙虾。然而,偶尔有一些龙虾由采样器1处理而不是采样器2处理,反之亦然。我想将两个采样器的数据组合为一个新的数据集,但是只从一个采样器处理的龙虾中删除所有数据。我已经使用了half_join和相交的dplyr,但是我需要在数据集1 - &gt;之间进行匹配。 2和2 <-1。我能够创建一个新的数据集来绑定两个采样器中的行,但不清楚如何删除新数据集中两个数据集之间的所有唯一龙虾ID。

以下是我的数据的简化版本,其中有多个鸡蛋面积测量值来自多个龙虾,但采样并不总是重叠(即,仅由一个采样器而不是另一个采样器测量的蛋):

install.packages(dplyr)
library(dplyr)

sampler1 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster2",
                                   "Lobster2","Lobster2","Lobster2",
                                   "Lobster2","Lobster3","Lobster3","Lobster3"),
                       Area=c(.4,.35,1.1,1.04,1.14,1.1,1.05,1.7,1.63,1.8),
                       Sampler=c(rep("Sampler1", 10)))
sampler2 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster1",
                                   "Lobster1","Lobster1","Lobster2",
                                   "Lobster2","Lobster2","Lobster4","Lobster4"),
                       Area=c(.41,.44,.47,.43,.38,1.14,1.11,1.09,1.41,1.4),
                       Sampler=c(rep("Sampler2", 10)))

combined <- bind_rows(sampler1, sampler2)

desiredresult <- combined[-c(8, 9, 10, 19, 20), ]

脚本的底线是模拟数据的期望结果。我希望限制使用基础R或dplyr。

4 个答案:

答案 0 :(得分:2)

combined <- bind_rows(sampler1, sampler2)


Lobsters.2.sample <- as.character(unique(sampler1$LobsterID)[unique(sampler1$LobsterID) %in% unique(sampler2$LobsterID)])

combined <- combined[combined$LobsterID %in% Lobsters.2.sample,]

答案 1 :(得分:2)

使用基础R

combined <-rbind(sampler1, sampler2)
inBoth <- intersect(sampler1[["LobsterID"]], sampler2[["LobsterID"]])
output <- combined[combined[["LobsterID"]] %in% inBoth, ]

intersect找到两个向量的集合并给出两个样本中的龙虾。所有函数都是矢量化的,所以它应该运行得非常快。

答案 2 :(得分:1)

将行,组和过滤器按每组中不同采样器的数量绑定:

sampler1 %>% bind_rows(sampler2) %>% 
    group_by(LobsterID) %>% 
    filter(n_distinct(Sampler) == 2)

## Source: local data frame [15 x 3]
## Groups: LobsterID [2]
## 
##    LobsterID  Area  Sampler
##        <chr> <dbl>    <chr>
## 1   Lobster1  0.40 Sampler1
## 2   Lobster1  0.35 Sampler1
## 3   Lobster2  1.10 Sampler1
## 4   Lobster2  1.04 Sampler1
## 5   Lobster2  1.14 Sampler1
## 6   Lobster2  1.10 Sampler1
## 7   Lobster2  1.05 Sampler1
## 8   Lobster1  0.41 Sampler2
## 9   Lobster1  0.44 Sampler2
## 10  Lobster1  0.47 Sampler2
## 11  Lobster1  0.43 Sampler2
## 12  Lobster1  0.38 Sampler2
## 13  Lobster2  1.14 Sampler2
## 14  Lobster2  1.11 Sampler2
## 15  Lobster2  1.09 Sampler2

答案 3 :(得分:1)

以下是使用data.table的选项。使用rbindlist绑定数据集,按“LobsterID”进行分组,并根据“采样器”中唯一元素的数量使用逻辑条件对行进行子集,即等于2.

library(data.table)
rbindlist(list(sampler1, sampler2))[, if(uniqueN(Sampler)==2) .SD , by = LobsterID]