是否可以根据许多条件进行合并?

时间:2019-11-28 14:03:17

标签: r dataframe merge

我想要实现的是能够基于日期比较数据,如果该日期在范围内,则取最低的“ PDF2”值。

这是我正在使用的两个数据框的示例。我想检查是否在“ df2”的“ R”列中找到了“ df”的“ R”列中的数据,请检查日期是否在df2的范围之间,是否有任何冲突或重复,我想始终保持“ PDF2”的最小值。

df <- data.frame("D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/12/2019"),
             "R" = c("ABC123", "ABC123", "ABC123", "ABC1"),
             "PDF" = c(1.23, 1.23, 1.23, 1.23),
             stringsAsFactors = FALSE)

df2 <- data.frame("DD" = c("01/01/2019", "01/02/2019", "01/01/2019"),
              "DF" = c("01/02/2019", "01/03/2019", "01/11/2019"),
              "R" = c("ABC123", "ABC123", "ABC1"),
              "PDF2" = c(1.12, 1.11, 1.12),
              stringsAsFactors = FALSE)

这是我期望的结果。

result <- data.frame("R" = c("ABC123", "ABC123", "ABC123"),
                 "D" = c("01/01/2019", "01/02/2019", "01/03/2019"),
                 "DD" = c("01/01/2019", "01/02/2019", "01/02/2019"),
                 "DF" = c("01/02/2019", "01/03/2019", "01/03/2019"),
                 "PDF" = c(1.23, 1.23, 1.23),
                 "PDF2" = c(1.12, 1.11, 1.11),
                 stringsAsFactors = FALSE)

您会看到结果中没有“ ABC1”,因为日期不在范围内。

我当前的问题是,仅在日期范围重复或发生冲突时才保留最小值。

这是我当前代码的示例:

temp <- merge(df, df2, by = "R")
myd <- which(as.Date(temp$D, format = "%d/%m/%Y") <= as.Date(temp$DF, format = "%d/%m/%Y"))
myd2 <- which(as.Date(temp$D, format = "%d/%m/%Y") >= as.Date(temp$DD, format = "%d/%m/%Y"))
myd <- myd[myd %in% myd2]
if (length(myd)) {
  temp <- temp[myd,]
}

还有如何在单独的数据框中获得与要求不符的行?

2 个答案:

答案 0 :(得分:1)

我认为该问题的答案可能会对您有所帮助:

How to find matches for a row in a dataframe conditional on many rows from another dataframe

mobile_number | city
--------------|------
1406-09-227   | Frankfurt
1206-09-221   | Weisbaden
1104-97-221   | Berlin
1507-92-329   | Saarbrücken

答案 1 :(得分:0)

如果您需要高效的工具,可以使用data.table软件包。以下代码可以满足您的要求

library(data.table)

setDT(df, key="R")
setDT(df2, key="R")

df[, D:=as.Date(D, format = "%d/%m/%Y")]
df2[, `:=`(
  DD = as.Date(DD, format = "%d/%m/%Y"),
  DF = as.Date(DF, format = "%d/%m/%Y")
)]

df[df2][D>=DD & D<=DF][, .(DD=max(DD), DF=max(DF), PDF2=PDF2[which.max(DD)]), .(D, R, PDF)]
##              D      R  PDF         DD         DF PDF2
##  1: 2019-01-01 ABC123 1.23 2019-01-01 2019-02-01 1.12
##  2: 2019-02-01 ABC123 1.23 2019-02-01 2019-03-01 1.11
##  3: 2019-03-01 ABC123 1.23 2019-02-01 2019-03-01 1.11