Question

我有以下数据框（DF_A）：

PARTY_ID PROBS_3001 PROBS_3002 PROBS_3003 PROBS_3004 PROBS_3005 PROBS_3006 PROBS_3007 PROBS_3008
1:  1000000       0.03       0.58       0.01       0.42       0.69       0.98       0.55       0.96
2:  1000001       0.80       0.37       0.10       0.95       0.77       0.69       0.23       0.07
3:  1000002       0.25       0.73       0.79       0.83       0.24       0.82       0.81       0.01
4:  1000003       0.10       0.96       0.53       0.59       0.96       0.10       0.98       0.76
5:  1000004       0.36       0.87       0.76       0.03       0.95       0.40       0.53       0.89
6:  1000005       0.15       0.78       0.24       0.21       0.03       0.87       0.67       0.64

我有另外一个数据帧（DF_B）：

    V1   V2   V3   V4 PARTY_ID
1 0.58 0.69 0.96 0.98  1000000
2 0.69 0.77 0.80 0.95  1000001
3 0.79 0.81 0.82 0.83  1000002
4 0.76 0.96 0.96 0.98  1000003
5 0.76 0.87 0.89 0.95  1000004
6 0.64 0.67 0.78 0.87  1000005

我需要在DF_B中找到DF_A元素的位置，如下所示：

  PARTY_ID P1 P2 P3 P4
1 1000000 3 6 9 7
...

目前我正在使用匹配功能，但需要花费很多时间（我有400K行）。我这样做：

i <- 1
while(i < nrow(DF_A)){
  position <- match(DF_B[i,],DF_A[i,])
  i <- i + 1
}

虽然它有效，但速度很慢，我知道这不是我问题的最佳答案。谁能帮帮我呢？

Answer 1

您可以合并，然后使用分组操作Map进行合并：

df_a2 <- df_a[setDT(df_b), on = "PARTY_ID"]
df_a3 <- df_a2[, c(PARTY_ID,
                Map(f = function(x,y) which(x==y), 
                  x = list(.SD[,names(df_a), with = FALSE]),
                  y = .SD[, paste0("V",1:4), with = FALSE])), by = 1:nrow(df_a2)]

setnames(df_a3, paste0("V",1:5), c("PARTY_ID", paste0("P", 1:4)))[,nrow:=NULL]
df_a3
#  PARTY_ID P1 P2 P3 P4
#1:  1000000  3  6  9  7
#2:  1000001  7  6  2  5
#3:  1000002  4  8  7  5
#4:  1000003  9  3  3  8
#5:  1000003  9  6  6  8
#6:  1000004  4  3  9  6
#7:  1000005  9  8  3  7

Answer 2

这是一个有两列的1百万行的例子。我的电脑需要14毫秒。

# create data tables with matching ids but on different positions

x <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), y=sample(LETTERS, 1000000, replace=TRUE)))
y <- as.data.table(data.frame(id=sample(c(1:1000000), 1000000, replace=FALSE), z=sample(LETTERS, 1000000, replace=TRUE)))

# add column to both data tables which will store the position in x and y

x$x_row_nr <- 1:nrow(x)
y$y_row_nr <- 1:nrow(y)

# set key in both data frames using matching columns name
setkey(x, "id")
setkey(y, "id")

# merge data tables into one 
z <- merge(x,y)

    # now you just use this to extract what is the position
# of 100 hundreth record in x data table in y data table

z[x_row_nr==100, y_row_nr]

z将包含来自两个数据集的匹配行记录，并附加了列。

使用R在其他数据框内查找数据框元素的位置

2 个答案: