我试图使用年份,品牌和型号,将来自NHTSA的已解码VIN数据与fueleconomy.gov的车辆数据结合起来。以下是我要加入的数据的示例:
d <- structure(list(Weather = c("Snow Low clouds", "Snow Cloudy",
"Drizzle Fog", "Thundershowers Partly cloudy", "Thunderstorms More clouds than sun",
"Sprinkles Partly cloudy", "Heavy rain Broken clouds", "Light rain Partly cloudy",
"Rain showers Passing clouds", "Thundershowers Scattered clouds",
"Thundershowers Passing clouds", "Light snow Overcast", "Snow Light fog",
"Drizzle Broken clouds", "Light rain Fog", "Cloudy", "Thunderstorms Partly cloudy",
"Heavy rain More clouds than sun", "Partly cloudy", NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L))
我在尝试完成此连接时遇到多个问题。
我已参考以下问题来尝试解决这个谜语:
我还联系了fueleconomy.gov和NHTSA,以查看它们是否具有基于车辆ID联接数据的功能,但我想问社区是否也可能有一个简单的解决方案。
答案 0 :(得分:1)
您的reprex中有一些错别字,所以我在下面再次粘贴。
# This is the first dataframe
make <- c("PORSCHE", "TESLA", "MITSUBISHI")
model <- c("Cayenne", "Model S", "Outlander - PHEV")
year <- c(2017, 2013, 2018)
electrification_level <- c("PHEV", "BEV", "PHEV")
vin_data <- data.frame(make, model, year, electrification_level, stringsAsFactors = FALSE)
# This is the second dataframe
make <- c("Porsche", "Tesla", "Mitsubishi")
# There are multiple versions of the models (an average of these would be ideal - e.g. avg. mpg)
model <- c("Cayenne S e-Hybrid", "Model S AWD - P85D", "Outlander 2WD")
year <- c(2017, 2013, 2018)
# These mpg are made up for the example
mpg <- c(75, 120, 80)
fueleconomy_data <- data.frame(make, model, year, mpg, stringsAsFactors = FALSE)
对于您的第一个问题,我将使用toupper
函数将它们全部更改为大写,然后使用完全联接。
df_joined <- vin_data %>%
full_join(fueleconomy_data %>%
dplyr::mutate(make = base::toupper(make)), by = "make")
对于#2,您可以使用一些if / else逻辑。我试过了,但您可以根据自己的喜好进行调整。
df_joined %>%
dplyr::mutate(model_same = if_else(condition = word(model.x) == word(model.y), true = TRUE, false = FALSE))
答案 1 :(得分:0)
您可以使用RecordLinkae包来获取想要的东西。您可以使用0.6的Epiwt来增加或降低文本匹配的准确性。
library(RecordLinkage)
pairs <- compare.linkage(fueleconomy_data, vin_data, strcmp = 2, exclude=c(3,4), blockfld = 1)
epiwt <- epiWeights(pairs)
epiclass <- epiClassify(epiwt, .6)
getPairs(epiclass, show="links", single.rows=T)
make.1 model.1 make.2 model.2 year.2 electrification_level.2
3 MITSUBISHI Outlander 2WD MITSUBISHI Outlander - PHEV 2018 PHEV
1 PORSCHE Cayenne S e-Hybrid PORSCHE Cayenne 2017 PHEV
2 TESLA Model S AWD - P85D TESLA Model S 2013 BEV
这是使列大写。您需要在记录链接之前执行此操作
vin_data$make <- toupper(vin_data$make)
fueleconomy_data$make <- toupper(fueleconomy_data$make)