虹膜数据集对三个物种中的每一个都有50个条目:
data('iris')
table(iris$Species)
setosa versicolor virginica
50 50 50
将虹膜数据集子集化为两个数据帧(具有重叠物种和非对称列),并与外部联接合并:
# missing Petal.Width
SV <- subset(iris, Species == 'setosa' | Species == 'virginica',
select = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Species'))
# missing Sepal.Length
VV <- subset(iris, Species == 'versicolor' | Species == 'virginica',
select = c('Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'))
SV_VV_merge <- merge(SV,VV,all=TRUE)
我为virginica
找到了16个额外的条目:
table(SV_VV_merge$Species)
setosa versicolor virginica
50 50 66
如何查看合并数据框中的哪些行与共享列有重复的内容&#39; Sepal.Width&#39; &#39; Petal.Length&#39; &#39;物种&#39;对于物种&#39; virginica&#39;?
答案 0 :(得分:1)
我们可以使用duplicated
duplicated(SV_VV_merge)
SV_VV_merge[duplicated(SV_VV_merge), ]
可以通过计算唯一行来确认结果
nrow(unique(SV_VV_merge))
[1] 154
请注意,您正在合并不同的列名子集,可能无法获得预期的结果。
intersect(names(VV), names(SV))
[1] "Sepal.Width" "Petal.Length" "Species"
答案 1 :(得分:1)
也许不是最直接的,但是为每个数据框添加一个指标列,并连接。
SV <- subset(iris, Species == 'setosa' | Species == 'virginica',
select = c('Sepal.Length', 'Sepal.Width', 'Petal.Length','Species'))
SV$sv_src <- "SV"
# missing Sepal.Length
VV <- subset(iris, Species == 'versicolor' | Species == 'virginica',
select = c('Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'))
VV$vv_src <- "VV"
SV_VV_merge <- merge(SV,VV,all=TRUE)
SV_VV_merge$row_src <- apply(SV_VV_merge[c("sv_src", "vv_src")], 1,
function(x) paste(na.omit(x), collapse = ""))
SV_VV_merge[, c("Sepal.Width", "Species", 'sv_src', 'vv_src', 'row_src')]
# Sepal.Width Species sv_src vv_src row_src
#1 2.0 versicolor <NA> VV VV
#2 2.2 versicolor <NA> VV VV
#3 2.2 versicolor <NA> VV VV
#4 2.2 virginica SV VV SVVV
#5 2.3 setosa SV <NA> SV