Question

我有两个数据帧。我想使用一个数据帧中的元素来搜索另一个数据帧中的列。我需要通过匹配来缩小这个数据帧。然后继续逐个元素缩小。查看示例代码，可以更好地解释。

df1    col1   

1      apples      
2      oranges     
3      apples    
4      banana  
5      grapes
6      mangoes
7      oranges
8      banana

df1中只有一列。同时df2有2列。 setID＆amp; COL1

df2 setID   col1

1   1   apples      
2   1   oranges     
3   1   oranges
4   1   mangoes
5   1   grapes
6   1   banana  
7   1   banana
8   1   apples    
10  2   apples      
11  2   oranges     
12  2   apples    
13  2   banana  
14  2   grapes
15  2   mangoes
16  2   banana
17  2   oranges
18  3   apples      
19  3   banana  
20  3   oranges     
21  3   apples    
22  3   grapes
23  3   mangoes
24  3   oranges
25  3   banana
26  4   apples      
27  4   oranges     
28  4   apples    
29  4   grapes
30  4   grapes
31  4   oranges     
32  4   banana  
33  4   banana

正如您所看到的，有一些重复的setID。他们标记一组。集合的顺序很重要。请注意，df1 $ col1的长度不必与df2的设置长度相同。它们也不必完全匹配。他们只需要足够接近。在这种情况下，df1 $ col1最接近df2 $ setID = 2，只有最后两个元素乱序。他们不必完全匹配的原因是因为我想在键入时使用＆＃34;搜索＆＃34;做法。我不想将df1 $ col1与df2上的setID匹配。我想通过逐个元素来缩小可能的集合。假设您逐个获取df1的元素，而不是完整的数据帧。例如：

从df2中找到df1 $ col1 [1]的匹配项，并将包含匹配项的任何集合保存到tempdf。如果在同一组中多次找到df1 $ col1 [1]的匹配，则无关紧要。如果至少找到一次，那么该组将被添加到tempdf。

最后需要检索的是一个setID，它对应于匹配df1的集合。在这种情况下，tempdf将与df2相同，因为所有集合都包括＆＃34; apples＆＃34;。接下来将匹配df1 $ col1 [2]与tempdf匹配，因为第一个元素是匹配。我想来自tempdf的df1 $ col1 [1：2]。这导致：

tempdf  setID   col1

1   1   apples      
2   1   oranges     
3   1   oranges
4   1   mangoes
5   1   grapes
6   1   banana  
7   1   banana
8   1   apples    
10  2   apples      
11  2   oranges     
12  2   apples    
13  2   banana  
14  2   grapes
15  2   mangoes
16  2   banana
17  2   oranges
26  4   apples      
27  4   oranges     
28  4   apples    
29  4   grapes
30  4   grapes
31  4   oranges     
32  4   banana  
33  4   banana

基本上省略了setID = 3。由于这继续来自df1的第3个元素，新的tempdf将仅包含setID 2＆amp; 4.循环（我想解决这个问题）将在只有一个setID保留时结束，在这种情况下setID = 2.因此setID = 2将被视为df1的紧密匹配。

当然可以自由地建议比这更好的方法。

Answer 1

您可能希望查看“比较”包，它允许您比较允许不同的转换。

以下是几个需要考虑的例子......

启动样本数据。注意setID == 4，它包含所有值，但顺序错误。

df1 <- data.frame(col1 = c("apples", "oranges", "apples", "banana"),
                  stringsAsFactors = FALSE)
df1
##      col1
## 1  apples
## 2 oranges
## 3  apples
## 4  banana

df2 <- structure(list(setID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 
    4, 4, 4, 4), col1 = c("apples", "oranges", "apples", "banana", 
    "apples", "grapes", "oranges", "apples", "oranges", "grapes", 
    "banana", "banana", "apples", "apples", "banana", "oranges")), 
    .Names = c("setID", "col1"), 
    row.names = c("1", "2", "3", "4", "5", "6", "7", "8", 
    "9", "10", "11", "12", "13", "21", "31", "41"), class = "data.frame")
df2
##    setID    col1
## 1      1  apples
## 2      1 oranges
## 3      1  apples
## 4      1  banana
## 5      2  apples
## 6      2  grapes
## 7      2 oranges
## 8      2  apples
## 9      3 oranges
## 10     3  grapes
## 11     3  banana
## 12     3  banana
## 13     4  apples
## 21     4  apples
## 31     4  banana
## 41     4 oranges

加载“比较”并进行一些比较：

library(compare)
lapply(split(df2[, "col1", drop = FALSE], df2$setID), 
       function(x) compare(df1, x))
## $`1`
## TRUE
## 
## $`2`
## FALSE [FALSE]
## 
## $`3`
## FALSE [FALSE]
## 
## $`4`
## FALSE [FALSE]
##

在比较之前允许所有转换（如果您只想允许某些转换，请参阅?compare了解详细信息。）

lapply(split(df2[, "col1", drop = FALSE], df2$setID), 
       function(x) compare(df1, x, allowAll = TRUE))
## $`1`
## TRUE
## 
## $`2`
## FALSE [FALSE]
##   sorted
##   [col1] ignored case
##   renamed rows
##   [col1] ignored case
##   dropped row names
##   [col1] ignored case
## 
## $`3`
## FALSE [FALSE]
##   sorted
##   [col1] ignored case
##   renamed rows
##   [col1] ignored case
##   dropped row names
##   [col1] ignored case
## 
## $`4`
## TRUE
##   sorted
##   renamed rows
##   dropped row names
##

Answer 2

使用基数R：

split(df2,df2[,1])[by(df2[2],df2[1],function(x)all(x==df1))]
 $`1`
   setID    col1
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana

Answer 3

OP已请求在setID中找到df2个col1中<{1}}的与<{1}}完全相同的组的df2组。

为了完整起见，这里还有一个data.table方法：

library(data.table)
tmp <- setDT(df2)[, all(col1 == df1$col1), by = setID][(V1)]
tmp

   setID   V1
1:     1 TRUE

现在，OP已请求返回匹配的行。这可以通过查找setID

的匹配值来完成

df2[setID %in% tmp$setID]

   setID    col1
1:     1  apples
2:     1 oranges
3:     1  apples
4:     1  banana

或通过加入（可能在大表上可能更快）

df2[tmp, on = "setID", .SD]

返回相同的结果。

买者

OP提供的样本数据集表明df1中的行数与setID中每个df2组中的行数相同。如果行数不同，OP未指定预期结果。

匹配R中两个数据帧中的观察结果

3 个答案:

买者