Question

您好我有一个两个巨大的大表（> 1000万行），其中包含两个随机顺序的ID。每行可以看作一对，如何才能获得两个表之间的唯一重叠？我知道在python中你可以在set中定义对象，R中是否有类似的功能？非常感谢！

表1

表2

理想输出

ID1 ID2
10  15 (count once only!)
26  71

Answer 1

对每对元素进行排序，以便合并起作用：

Sub test()
    Dim shp As Shape, s As String
    Set shp = ActiveSheet.Shapes(1)
    s = shp.TextFrame2.TextRange.Text ' this is a string which doesn't have a Copy method
    Debug.Print s
    'but:
    shp.TextFrame2.TextRange.Copy 'copies to clipboard!
End Sub

使用dplyr，您也可以

pairs1 <- unique(t(apply(DF1,1,sort)))
pairs2 <- unique(t(apply(DF2,1,sort)))

merge(pairs1,pairs2)
#   V1 V2
# 1 10 15
# 2 26 71

在比较＆＃34;设置＆＃34;。

时具有更直观的名称

Answer 2

您可以使用软件包dplyr中的函数semi_join或anti_join。但也有其他策略：

table1<-data.frame(id=c(1:5), animal=c("cat", "dog", "parakeet", 
"lion", "duck"))
table1
##   id   animal
## 1  1      cat
## 2  2      dog
## 3  3 parakeet
## 4  4     lion
## 5  5     duck

table2<-table1[c(1,3,5),]
table2
##   id   animal
## 1  1      cat
## 3  3 parakeet
## 5  5     duck

# strategy 1
table1[!table1$id%in%table2$id,]
##   id animal
## 2  2    dog
## 4  4   lion

# strategy 2
table1[is.na(match(table1$id,table2$id)),]
##   id animal
## 2  2    dog
## 4  4   lion

# strategy 3. anti join
library(dplyr)
anti_join(table1, table2, by="id")
##   id animal
## 1  2    dog
## 2  4   lion

在R中设置比较

2 个答案: