Question

我有2个数据帧df_1和df_2。他们共有3个共同点：permno，cusip和ticker。 df_1的每一行都是一个独特的股票。 df_1中的permno，cusip和ticker用于识别df_2中的库存收益。有时这些变量中的一个或两个不可用，但在每一行中至少有一个变量可用。我将使用该值来查找df_2中的返回值。

如果在permno，cusip或ticker三列中至少有一列匹配，你能否建议合并df_1和df_2的任何（快速）方式。

df_1

id  permno  cusip  ticker
1   1       11     AA
2   NA      12     NA
3   2       13     NA
4   5       NA     NA

df_2

permno  cusip  ticker  return  date
1       11     NA      100     date_1
7       15     BX      102     date_2
2       NA     CU      103     date_3

期望的结果

id  permno  cusip  ticker  return  date
1   1       11     AA      100     date_1
1   1       11     NA      100     date_1
3   2       13     NA      103     date_3
3   2       NA     CU      103     date_3

Answer 1

这应该有用。

# define common columns in both data frames 
colmatch <- c("permno", "cusip", "ticker")

# function to trim down data frame A to just those with rows
# that have at least one match in common column with data frame B
# and append columns from B which are not found in A
simplify <- function(df1, df2, col = colmatch) {
  # find all common column elements that matches
  idx <- sapply(col, function(x)
    match(df1[[x]], df2[[x]], incomparables=NA)
  )

  # find rows in first data frame with at least one match
  idx1 <- which(apply(idx, 1, function(x) !all(is.na(x))))

  # find corresponding rows in second data frame
  idx2 <- apply(idx[idx1, ], 1, function(x) x[min(which(!is.na(x)))])

  # copy columns from second data frame to first data frame
  # only for rows which matches above
  dff <- cbind(df1[idx1, ], df2[idx2, !(names(df2) %in% colmatch), drop=F])
}


# assemble the final output
df_final <- rbind(simplify(df_1, df_2),  # find df_1 rows with matches in df_2
                  simplify(df_2, df_1))  # and vice versa

最终输出（如果您喜欢按id排序）

> df_final[order(df_final$id), ]
   id permno cusip ticker return   date
1   1      1    11     AA    100 date_1
11  1      1    11   <NA>    100 date_1
3   3      2    13   <NA>    103 date_3
31  3      2    NA     CU    103 date_3

按多列合并2个数据帧，如果至少有一列匹配则保持一行

1 个答案: