我有两个不同采样器采集的龙虾蛋大小数据集,用于评估测量的可变性。每个采样器测量来自无数龙虾的~50个鸡蛋\龙虾。然而,偶尔有一些龙虾由采样器1处理而不是采样器2处理,反之亦然。我想将两个采样器的数据组合为一个新的数据集,但是只从一个采样器处理的龙虾中删除所有数据。我已经使用了half_join和相交的dplyr,但是我需要在数据集1 - &gt;之间进行匹配。 2和2 <-1。我能够创建一个新的数据集来绑定两个采样器中的行,但不清楚如何删除新数据集中两个数据集之间的所有唯一龙虾ID。
以下是我的数据的简化版本,其中有多个鸡蛋面积测量值来自多个龙虾,但采样并不总是重叠(即,仅由一个采样器而不是另一个采样器测量的蛋):
install.packages(dplyr)
library(dplyr)
sampler1 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster2",
"Lobster2","Lobster2","Lobster2",
"Lobster2","Lobster3","Lobster3","Lobster3"),
Area=c(.4,.35,1.1,1.04,1.14,1.1,1.05,1.7,1.63,1.8),
Sampler=c(rep("Sampler1", 10)))
sampler2 <- data.frame(LobsterID=c("Lobster1","Lobster1","Lobster1",
"Lobster1","Lobster1","Lobster2",
"Lobster2","Lobster2","Lobster4","Lobster4"),
Area=c(.41,.44,.47,.43,.38,1.14,1.11,1.09,1.41,1.4),
Sampler=c(rep("Sampler2", 10)))
combined <- bind_rows(sampler1, sampler2)
desiredresult <- combined[-c(8, 9, 10, 19, 20), ]
脚本的底线是模拟数据的期望结果。我希望限制使用基础R或dplyr。
答案 0 :(得分:2)
combined <- bind_rows(sampler1, sampler2)
Lobsters.2.sample <- as.character(unique(sampler1$LobsterID)[unique(sampler1$LobsterID) %in% unique(sampler2$LobsterID)])
combined <- combined[combined$LobsterID %in% Lobsters.2.sample,]
答案 1 :(得分:2)
使用基础R
combined <-rbind(sampler1, sampler2)
inBoth <- intersect(sampler1[["LobsterID"]], sampler2[["LobsterID"]])
output <- combined[combined[["LobsterID"]] %in% inBoth, ]
intersect
找到两个向量的集合并给出两个样本中的龙虾。所有函数都是矢量化的,所以它应该运行得非常快。
答案 2 :(得分:1)
将行,组和过滤器按每组中不同采样器的数量绑定:
sampler1 %>% bind_rows(sampler2) %>%
group_by(LobsterID) %>%
filter(n_distinct(Sampler) == 2)
## Source: local data frame [15 x 3]
## Groups: LobsterID [2]
##
## LobsterID Area Sampler
## <chr> <dbl> <chr>
## 1 Lobster1 0.40 Sampler1
## 2 Lobster1 0.35 Sampler1
## 3 Lobster2 1.10 Sampler1
## 4 Lobster2 1.04 Sampler1
## 5 Lobster2 1.14 Sampler1
## 6 Lobster2 1.10 Sampler1
## 7 Lobster2 1.05 Sampler1
## 8 Lobster1 0.41 Sampler2
## 9 Lobster1 0.44 Sampler2
## 10 Lobster1 0.47 Sampler2
## 11 Lobster1 0.43 Sampler2
## 12 Lobster1 0.38 Sampler2
## 13 Lobster2 1.14 Sampler2
## 14 Lobster2 1.11 Sampler2
## 15 Lobster2 1.09 Sampler2
答案 3 :(得分:1)
以下是使用data.table
的选项。使用rbindlist
绑定数据集,按“LobsterID”进行分组,并根据“采样器”中唯一元素的数量使用逻辑条件对行进行子集,即等于2.
library(data.table)
rbindlist(list(sampler1, sampler2))[, if(uniqueN(Sampler)==2) .SD , by = LobsterID]