我有多人输入的管理数据,因此许多列表输入错误/格式不正确或拼写不正确。例如,我应该将所有福特列为'福特'但相反,我有“福特'福特金牛座'福特f150' 1980年卡普里经典'等”的参赛作品。 / p>
我正在尝试创建一个新列,以一种格式列出所有汽车品牌(例如,所有上述福特列表都会出现在'福特')。我试图寻找答案,但似乎没有任何效果。
例如: (EquipMake是包含原始数据的列,New_Make是我要创建的列)
**EquipMake** **New_Make**
1980 Capri Classic Ford
Camry Toyota
NISON Nissan
ford Ford
Mitsubishi Eclipse Con Mitsubishi
Cadilac Seville Cadillac
Dodge Caravan Dodge
1987 Ford Ford
Honda Accord Honda
poss / pontiac Unknown
Oldsmobile Cutless Cie Oldsmobile
bmw BMW
下面的代码是我最接近它的代码,但它仅适用于某些条目,我无法弄清楚为什么......
mydata[grep("?|???|N/A|NA|NONE PROVIDED|NONE GIVEN|NOT PROVIDED|U/K|UNKNOWN|UNKNOWJN|UNLISTED|poss / pontiac|possible cutlass|possibly Honda|UNAVAILABLE|unlisted", mydata$EquipMake), "New_Make"] <- "Unknown"
mydata[grep("021CVG|1993|1B3BP44KLYN100171|20 FOOT|3 WHEELER|301 DORSEY|AREIE K CAR|BLACKWOOD HODGE|BLUE BIRD|BOLER|CAVCO|CLAYNOR TRAILER SALES|CMC|COMFORT|CRAFTSMAN|CUSTOM BUILT|DIESEL|FIFTH WHEEL|GARAGE TRUCK|GILLNETTER|GRUMAN|GRUHMANN|GULF STREAM|HONDAY|INTERNATIONAL|INTERNATIONAL - EAGLE|K-CAR|KING OF THE ROAD|KOBELCO|MCI|MIDAS CHATEAU|NATIONAL|OKANAGAN|ORCA|ORD|PHMAN|PICKUP|SCOOTER|SEDAN|TORO|TRAILER|UTILITY TRAILER|WABASH|WILDERNESS|AMER|AMER. MOTORS|DAMON|DAMON CORP|FREIGHT TRAILER", mydata$EquipMake), "New_Make"] <-"Other"
mydata[grep("1979 DRUMMOND", mydata$EquipMake), "New_Make"] <- "Drummond"
mydata[grep("APPEARED TO BE A FORD|CORSAIR|FORD|FORC F-150|FORD AEROSTAR|FORD ?|FORD 150 XLT LIGHT B|FORD E-350|FORD EXPLORER|FORD EXPLORER?|FORD F-150|FORD F-350|FORD F150|FORD F350|FORD MUSTANG|FORD MUSTANG 07|FORD PROBE|FORD TAURUS|FORD TAURUS LICENSE 0|FORD TEMPO|FORD THUNDERBIRD|FORD TRUCK|FORD,|1980 CAPRI CLASSIC|MUSTANG|TEMPO|THUNDERBIRD|TRANSIT|WHITE 1991 FORD TEMPO", mydata$EquipMake), "New_Make"] <- "Ford"
mydata[grep("1989 HONDA ACCORD|CIVIC|HOJNDA|HONDA|HONDA ACCORD|HONDA (CIVIC?)|HONDA ACCORD|HONDA CIVIC|HONDA CIVIC 4 DR. 1|HONDA PRELUDE|HONDA SUV|HONDA?", mydata$EquipMake), "New_Make"] <- "Honda"
mydata[grep("1992 volkswagen|2000 Jetta|jetta|passat|V.W.|volkawagen|volkawagon|VOLKS|volkswagan|volkswagen|volkswagon|Volkswagon Golf|volkswagon passat|Volkwagon|vw|VW camperized van|vw jetta|VW Passat", mydata$EquipMake), "New_Make"] <- "Volkswagon"
mydata[grep("1993 Buick Regal|buick|buick acheiva|buick alero|buick regal|buiick riveria|alero", mydata$EquipMake), "New_Make"] <- "Buick"
mydata[grep("1994 GMC Extra can lon|G.M.C.|GMC|GMC - Sierra|GMC Jimmy|GMC Tracker|GMC 3500|Gmc 3500 Truck|gmc 4x4 stolen bc 77|GMC Discovery|GMC Jimmy|GMC SAFARI|gmc sierra truck|GMC van|gmc vandura|GMC Vanguard|GMC/Chevrolet|vanguard", mydata$EquipMake), "New_Make"] <- "GMC"
mydata[grep("2005 audi a4 1.8 l|audi", mydata$EquipMake), "New_Make"] <- "Audi"
(这些只是前几行代码 - 全部有90行)
当我查看输出时,新列中应该有90个不同的品牌,但只有21个有效(剩下的就是&#34;未知&#34;)。在上面的代码中,Drummond,Audi和Buick都没有用。
有人能告诉我为什么这不起作用吗?或者,让我指出一些可行的方向?
我在使用R时相当新,所以解释越简单越好:)
谢谢!
答案 0 :(得分:0)
您应该考虑数据所需的格式。看起来您正在记录每个输入值并写下与其对应的品牌,并且您想要查找EquipMake的所有出现并为其分配适当的New_Make值。正如复员在评论中指出的那样,还有其他方法可以解决这个问题。但是,如果您采用这种方法,那么尝试grep
每个值的方法要容易得多。创建一个包含两列(EquipMake和New_Make)的tidy数据集,以及每个要重新编码的EquipMake值的一行。然后通过left_join
(dplyr
包中的tidyverse
函数中的library(tidyverse) # Should be part of all data science workflows
###############################
# Generate data
grep_data <- c("?|???|N/A|NA|NONE PROVIDED|NONE GIVEN|NOT PROVIDED|U/K|UNKNOWN|UNKNOWJN|UNLISTED|poss / pontiac|possible cutlass|possibly Honda|UNAVAILABLE|unlisted",
"021CVG|1993|1B3BP44KLYN100171|20 FOOT|3 WHEELER|301 DORSEY|AREIE K CAR|BLACKWOOD HODGE|BLUE BIRD|BOLER|CAVCO|CLAYNOR TRAILER SALES|CMC|COMFORT|CRAFTSMAN|CUSTOM BUILT|DIESEL|FIFTH WHEEL|GARAGE TRUCK|GILLNETTER|GRUMAN|GRUHMANN|GULF STREAM|HONDAY|INTERNATIONAL|INTERNATIONAL - EAGLE|K-CAR|KING OF THE ROAD|KOBELCO|MCI|MIDAS CHATEAU|NATIONAL|OKANAGAN|ORCA|ORD|PHMAN|PICKUP|SCOOTER|SEDAN|TORO|TRAILER|UTILITY TRAILER|WABASH|WILDERNESS|AMER|AMER. MOTORS|DAMON|DAMON CORP|FREIGHT TRAILER",
"1979 DRUMMOND",
"APPEARED TO BE A FORD|CORSAIR|FORD|FORC F-150|FORD AEROSTAR|FORD ?|FORD 150 XLT LIGHT B|FORD E-350|FORD EXPLORER|FORD EXPLORER?|FORD F-150|FORD F-350|FORD F150|FORD F350|FORD MUSTANG|FORD MUSTANG 07|FORD PROBE|FORD TAURUS|FORD TAURUS LICENSE 0|FORD TEMPO|FORD THUNDERBIRD|FORD TRUCK|FORD,|1980 CAPRI CLASSIC|MUSTANG|TEMPO|THUNDERBIRD|TRANSIT|WHITE 1991 FORD TEMPO",
"1989 HONDA ACCORD|CIVIC|HOJNDA|HONDA|HONDA ACCORD|HONDA (CIVIC?)|HONDA ACCORD|HONDA CIVIC|HONDA CIVIC 4 DR. 1|HONDA PRELUDE|HONDA SUV|HONDA?",
"1992 volkswagen|2000 Jetta|jetta|passat|V.W.|volkawagen|volkawagon|VOLKS|volkswagan|volkswagen|volkswagon|Volkswagon Golf|volkswagon passat|Volkwagon|vw|VW camperized van|vw jetta|VW Passat",
"1993 Buick Regal|buick|buick acheiva|buick alero|buick regal|buiick riveria|alero",
"1994 GMC Extra can lon|G.M.C.|GMC|GMC - Sierra|GMC Jimmy|GMC Tracker|GMC 3500|Gmc 3500 Truck|gmc 4x4 stolen bc 77|GMC Discovery|GMC Jimmy|GMC SAFARI|gmc sierra truck|GMC van|gmc vandura|GMC Vanguard|GMC/Chevrolet|vanguard",
"2005 audi a4 1.8 l|audi")
make_data <- c("Unknown", "Other", "Drummond", "Ford", "Honda", "Volkswagen", "Buick", "GMC", "Audi")
raw_reference <- tibble(grep_data, make_data)
make_replacement_table <- function(namestring) {
strsplit(namestring[1],
split = "|",
fixed = TRUE
) %>% unlist %>%
tibble(., namestring[2]) %>%
set_names(c("EquipMake", "New_Make"))
}
# Generate a dataset that has both known and unknown values for EquipMake
mydata <- sample(reference_table$EquipMake, size = 1000, replace = TRUE) %>%
tbl_df %>%
set_names("EquipMake")
###############################
# The answer to your question
# Create the lookup table containing original and replacement values
# You could create the table in Excel and import with readr::read_csv()
reference_table <- apply(raw_reference, 1, make_replacement_table) %>%
do.call(rbind.data.frame, .)
# Now join reference_table against your raw data
# Any values of EquipMake you haven't coded will be NA
mydata <- mydata %>%
left_join(reference_table)
函数将该数据集加入主数据。
0:14:19 Cannot copy 'C:\Users\Workspace\AndroidStudio\Event\app\build\intermediates\exploded-aar\com.android.support.test\runner\0.5\jars\classes.jar'
to 'C:\Users\pc\.AndroidStudio2.2\system\jars\classes.jar'.
Reason: null.