使用最接近的匹配值交叉引用两个表

时间:2019-01-20 23:15:41

标签: r

我需要交叉引用两个表,并在第二个表的基础上创建另一个变量。这两个表是:

> dput(df)
structure(list(PlayerName = "Example", DateOfBirth = structure(1069113600, class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), DateOfTest = structure(1476316800, class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), Stature = 151.7, SittingHeight = 77, 
    BodyMass = 74, Age = 12.9034907597536, LegLength = 74.7, 
    year_from_phv = -0.993206850280964, AgeAtPHV = 13.8966976100346, 
    Maturation_stat = "Average"), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -1L))

> dput(reference)
structure(list(year_from_phv = c(-1, -0.8, -0.6, -0.4, -0.2, 
0, 0.2, 0.4, 0.6, 0.8, 1, -1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 
0.4, 0.6, 0.8, 1, -1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 
0.8, 1), Maturation_stat = c("Early", "Early", "Early", "Early", "Early", 
"Early", "Early", "Early", "Early", "Early", "Early", "Average", 
"Average", "Average", "Average", "Average", "Average", "Average", 
"Average", "Average", "Average", "Average", "Late", "Late", "Late", 
"Late", "Late", "Late", "Late", "Late", "Late", "Late", "Late"
), cm = c("27.66", "26.24", "24.68", "22.96", "21.07", "19.04", 
"16.96", "14.92", "13.01", "11.26", "9.6999999999999993", "24.36", 
"22.99", "21.51", "19.88", "18.09", "16.16", "14.21", "12.35", 
"10.65", "9.1199999999999992", "7.78", "20.22", "18.96", "17.68", 
"16.31", "14.76", "13.05", "11.32", "9.7100000000000009", "8.27", 
"6.94", "5.7")), row.names = c(NA, -33L), class = c("tbl_df", 
"tbl", "data.frame"))

在其中,我需要:

  1. 查看df$Maturation_stat,然后在哪里过滤 reference$Maturation_stat相同,则:
    1. 查看df$year_from_phv,然后在reference$year_from_phv中找到最匹配的值
  2. 基于上述两个过滤器,返回reference$cm的值,并将其作为df中的变量。对于df中的示例数据,应返回24.36

如果可能的话,是否也可以将其包装在函数中?

2 个答案:

答案 0 :(得分:2)

第一次尝试,您可以遍历df的每一行并实现逻辑以找到reference的匹配行

# create the extra column of df
df$cm <- NA
for (i in 1:nrow(df)) {
    # find rows in reference with the same Maturation_stat
    reference_ss <- reference[reference$Maturation_stat == df$Maturation_stat[i])

    # find the closest year_from_phv
    reference_ss <- reference_ss[which.min(abs(df$year_from_phv[i] - reference_ss$year_from_phv[i]))]

    # extract the cm and store it
    df$cm[i] <- reference_ss$cm[1]
}

注意事项-我们假设我们始终可以找到匹配的行,并且仅存储第一条此类匹配行的cm。您将不得不研究可能与一个reference行匹配的多个df行的边缘情况。


如果想花哨的话,可以使用data.table包通过滚动连接合并数据框

library(data.table)
# make dataframes to datatables
setDT(df)
setDT(reference)

# look up rows in reference matching rows in df
# join on the Maturation_stat and year_from_phv columns
# roll='nearest' means find the nearest year_from_phv if we can't match it
reference[df, on=.(Maturation_stat, year_from_phv), roll='nearest']

答案 1 :(得分:0)

喜欢吗?

add_cm <- function(df, reference) {
    # Filter for equal Maturation_stat
    filter1 <- reference[reference$Maturation_stat==df$Maturation_stat, ]
    # Calculate absolute difference of year_from_phv from reference and df 
    filter2 <- transform(filter1, diff=abs(year_from_phv-df$year_from_phv))
    # Add cm with minimum absolute difference
    df$cm <- filter2$cm[which.min(filter2$diff)]     
    df
}

add_cm(df, reference)

  PlayerName DateOfBirth DateOfTest Stature SittingHeight BodyMass      Age
1    Example  2003-11-18 2016-10-13   151.7            77       74 12.90349
  LegLength year_from_phv AgeAtPHV Maturation_stat    cm
1      74.7    -0.9932069  13.8967         Average 24.36