在关键联合上加入数据帧

时间:2016-09-06 17:04:59

标签: r dplyr

我的数据集LR包含两个可能的不同的标识符,tickercusip都不完整:

library(dplyr)
set.seed(123)
L = data.frame(ticker = c(1, 2, 3, NA, NA, 6, 7, NA),
               cusip = c(NA, NA, NA, 4, 5, 6, 7, NA),
               x = runif(8))
R = data.frame(ticker = c(1, 2, 3, 4, 5, 6, 6, 9, 9, 10),
               cusip = c(1, 2, 3, 4, 5, 6, 6, 9, 9, 10),
               y = runif(10))

> L
  ticker cusip         x
1      1    NA 0.2875775
2      2    NA 0.7883051
3      3    NA 0.4089769
4     NA     4 0.8830174
5     NA     5 0.9404673
6      6     6 0.0455565
7      7     7 0.5281055
8     NA    NA 0.8924190

> R
   ticker cusip          y
1       1     1 0.55143501
2       2     2 0.45661474
3       3     3 0.95683335
4       4     4 0.45333416
5       5     5 0.67757064
6       6     6 0.57263340
7       6     6 0.10292468
8       9     9 0.89982497
9       9     9 0.24608773
10     10    10 0.04205953

我希望LR的左连接,其中两者或其中一个键匹配。

有没有比手动做第一个更好的方法来实现这个目标

left_join(L, R, by="ticker")

left_join(L, R, by="cusip")

然后组合结果,重命名所有列?

应该给出输出

> result
  ticker   cusip         x            y
1      1      NA 0.2875775       0.5514350
2      2      NA 0.7883051       0.4566147
3      3      NA 0.4089769       0.9568333
4     NA       4 0.8830174       0.4533342
5     NA       5 0.9404673       0.6775706
6      6       6 0.0455565       0.5726334
7      6       6 0.0455565       0.1029247
8      7       7 0.5281055              NA
9     NA      NA 0.8924190              NA

修改(澄清):标识符tickercusip不一定相同

1 个答案:

答案 0 :(得分:1)

L %>%
  left_join(select(R, ticker, y), by = "ticker") %>%
  left_join(select(R, cusip, y), by = "cusip") %>%
  # we now have x, y.x, and y.y
  gather(ign, y, y.x:y.y) %>%
  select(-ign) %>%
  filter(! duplicated(.)) %>%
  group_by(ticker, cusip, x) %>%
  filter(n() == 1 | ! is.na(y)) %>%
# # A tibble: 9 x 4
#   ticker cusip         x         y
#    <dbl> <dbl>     <dbl>     <dbl>
# 1      1    NA 0.2875775 0.5514350
# 2      2    NA 0.7883051 0.4566147
# 3      3    NA 0.4089769 0.9568333
# 4      6     6 0.0455565 0.5726334
# 5      6     6 0.0455565 0.1029247
# 6      7     7 0.5281055        NA
# 7     NA    NA 0.8924190        NA
# 8     NA     4 0.8830174 0.4533342
# 9     NA     5 0.9404673 0.6775706

因为加入规则有点......软弱(?),我不确定是否会采用更优雅的方式。我希望SO dplyr - 大师证明我错了: - )