如何在多列上联接数据框并在一个列上进行模糊匹配?

时间:2019-12-17 16:19:46

标签: r dplyr fuzzyjoin

我试图使用年份,品牌和型号,将来自NHTSA的已解码VIN数据与fueleconomy.gov的车辆数据结合起来。以下是我要加入的数据的示例:

d <- structure(list(Weather = c("Snow Low clouds", "Snow Cloudy", 
"Drizzle Fog", "Thundershowers Partly cloudy", "Thunderstorms More clouds than sun", 
"Sprinkles Partly cloudy", "Heavy rain Broken clouds", "Light rain Partly cloudy", 
"Rain showers Passing clouds", "Thundershowers Scattered clouds", 
"Thundershowers Passing clouds", "Light snow Overcast", "Snow Light fog", 
"Drizzle Broken clouds", "Light rain Fog", "Cloudy", "Thunderstorms Partly cloudy", 
"Heavy rain More clouds than sun", "Partly cloudy", NA)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -20L))

我在尝试完成此连接时遇到多个问题。

  1. 我需要使用make和year来联接数据,但是make需要不区分大小写。
  2. 我需要对模型执行不完全匹配,也许需要对模型的mpg值取平均值,因为每个模型的fueleconomy.gov数据中都有多个条目(例如2WD,4WD,不同的发动机尺寸,混合动力等)。 )。

我已参考以下问题来尝试解决这个谜语:

我还联系了fueleconomy.gov和NHTSA,以查看它们是否具有基于车辆ID联接数据的功能,但我想问社区是否也可能有一个简单的解决方案。

2 个答案:

答案 0 :(得分:1)

您的reprex中有一些错别字,所以我在下面再次粘贴。

# This is the first dataframe
make <- c("PORSCHE", "TESLA", "MITSUBISHI")
model <- c("Cayenne", "Model S", "Outlander - PHEV")
year <- c(2017, 2013, 2018)
electrification_level <- c("PHEV", "BEV", "PHEV")
vin_data <- data.frame(make, model, year, electrification_level, stringsAsFactors = FALSE)

# This is the second dataframe    
make <- c("Porsche", "Tesla", "Mitsubishi")
# There are multiple versions of the models (an average of these would be ideal - e.g. avg. mpg)
model <- c("Cayenne S e-Hybrid", "Model S AWD - P85D", "Outlander 2WD")
year <- c(2017, 2013, 2018)
# These mpg are made up for the example
mpg <- c(75, 120, 80)
fueleconomy_data <- data.frame(make, model, year, mpg, stringsAsFactors = FALSE) 

对于您的第一个问题,我将使用toupper函数将它们全部更改为大写,然后使用完全联接。

df_joined <- vin_data %>% 
  full_join(fueleconomy_data %>% 
    dplyr::mutate(make = base::toupper(make)), by = "make")

对于#2,您可以使用一些if / else逻辑。我试过了,但您可以根据自己的喜好进行调整。

df_joined %>%  
  dplyr::mutate(model_same = if_else(condition = word(model.x) == word(model.y), true = TRUE, false = FALSE))

答案 1 :(得分:0)

您可以使用RecordLinkae包来获取想要的东西。您可以使用0.6的Epiwt来增加或降低文本匹配的准确性。

library(RecordLinkage)
pairs <- compare.linkage(fueleconomy_data, vin_data, strcmp = 2, exclude=c(3,4), blockfld = 1) 
epiwt <- epiWeights(pairs) 
epiclass <- epiClassify(epiwt, .6)
getPairs(epiclass, show="links", single.rows=T)

make.1 model.1 make.2 model.2 year.2 electrification_level.2 3 MITSUBISHI Outlander 2WD MITSUBISHI Outlander - PHEV 2018 PHEV 1 PORSCHE Cayenne S e-Hybrid PORSCHE Cayenne 2017 PHEV 2 TESLA Model S AWD - P85D TESLA Model S 2013 BEV

这是使列大写。您需要在记录链接之前执行此操作
vin_data$make <- toupper(vin_data$make) fueleconomy_data$make <- toupper(fueleconomy_data$make)