Question

我有两个数据框，A和B.

a1 <- c(12, 12, 12, 23, 23, 23, 34, 34, 34)
a2 <- c(1, 2, 3 , 2, 4 , 5 , 2 , 3 , 4)
A <- as.data.frame(cbind(a1, a2))

b1 <- c(12, 23, 34)
b2 <- c(1, 2, 2)
B <- as.data.frame(cbind(b1, b2))

> A
  a1 a2
1 12  1
2 12  2
3 12  3
4 23  2
5 23  4
6 23  5
7 34  2
8 34  3
9 34  4
> B
  b1 b2
1 12  1
2 23  2
3 34  2

基本上，B包含A中的行，每个唯一a1的最低值为a2。

我需要做的很简单。找到行索引（或行号？）让我们为index.vector调用它，这样A [index.vector，]等于B。

对于这个特殊问题，只有一个解决方案，因为对于a1的每个唯一值，a2中没有相同的值。

感谢任何帮助，例程越快越好。需要将此应用于数据框，其中包含500到数百万行。

Answer 1

我确保首先订购我的数据（在您的示例中，数据是正确排序的，但我想这可能并非总是如此），然后使用match返回索引它首先匹配它的第二个参数（如果没有匹配则为NA）。

A <- A[ order( A$a1 , A$a2 ) , ]
A
#  a1 a2
#1 12  1
#2 12  2
#3 12  3
#4 23  2
#5 23  4
#6 23  5
#7 34  2
#8 34  3
#9 34  4

#  Get row indices for required values
match( B$b1 , A$a1 )
[1] 1 4 7

这是一个data.table解决方案，对于大型表来说应该远更快

require(data.table)
A <- data.table( A )
B <- data.table( B )

#  Use setkeyv to order the tables by the values in the first column, then the second column
setkeyv( A , c("a1","a2") )
setkeyv( B , c("b1","b2") )

#  Create a new column that is the row index of A
A[ , ID:=(1:nrow(A)) ]

#  Join A and B on the key columns (this works because you have unique values in your second column for each grouping of the first), giving the relevant ID
A[B]
#   a1 a2 ID
#1: 12  1  1
#2: 23  2  4
#3: 34  2  7

数据帧A中的行索引，包含在数据帧B中

1 个答案: