使用R中现有列的数据创建新列

时间:2016-08-19 04:40:14

标签: r

我有多人输入的管理数据,因此许多列表输入错误/格式不正确或拼写不正确。例如,我应该将所有福特列为'福特'但相反,我有“福特'福特金牛座'福特f150' 1980年卡普里经典'等”的参赛作品。 / p>

我正在尝试创建一个新列,以一种格式列出所有汽车品牌(例如,所有上述福特列表都会出现在'福特')。我试图寻找答案,但似乎没有任何效果。

例如: (EquipMake是包含原始数据的列,New_Make是我要创建的列)

**EquipMake**                 **New_Make**
1980 Capri Classic            Ford
Camry                         Toyota
NISON                         Nissan
ford                          Ford 
Mitsubishi Eclipse Con        Mitsubishi
Cadilac  Seville              Cadillac
Dodge Caravan                 Dodge
1987 Ford                     Ford
Honda Accord                  Honda
poss / pontiac                Unknown
Oldsmobile Cutless Cie        Oldsmobile
bmw                           BMW

下面的代码是我最接近它的代码,但它仅适用于某些条目,我无法弄清楚为什么......

mydata[grep("?|???|N/A|NA|NONE PROVIDED|NONE GIVEN|NOT PROVIDED|U/K|UNKNOWN|UNKNOWJN|UNLISTED|poss / pontiac|possible cutlass|possibly Honda|UNAVAILABLE|unlisted", mydata$EquipMake), "New_Make"] <- "Unknown"
mydata[grep("021CVG|1993|1B3BP44KLYN100171|20 FOOT|3 WHEELER|301 DORSEY|AREIE K CAR|BLACKWOOD HODGE|BLUE BIRD|BOLER|CAVCO|CLAYNOR TRAILER SALES|CMC|COMFORT|CRAFTSMAN|CUSTOM BUILT|DIESEL|FIFTH WHEEL|GARAGE TRUCK|GILLNETTER|GRUMAN|GRUHMANN|GULF STREAM|HONDAY|INTERNATIONAL|INTERNATIONAL - EAGLE|K-CAR|KING OF THE ROAD|KOBELCO|MCI|MIDAS  CHATEAU|NATIONAL|OKANAGAN|ORCA|ORD|PHMAN|PICKUP|SCOOTER|SEDAN|TORO|TRAILER|UTILITY TRAILER|WABASH|WILDERNESS|AMER|AMER. MOTORS|DAMON|DAMON CORP|FREIGHT TRAILER", mydata$EquipMake), "New_Make"] <-"Other"
mydata[grep("1979 DRUMMOND", mydata$EquipMake), "New_Make"] <- "Drummond"
mydata[grep("APPEARED TO BE A FORD|CORSAIR|FORD|FORC F-150|FORD  AEROSTAR|FORD ?|FORD 150 XLT   LIGHT B|FORD E-350|FORD EXPLORER|FORD EXPLORER?|FORD F-150|FORD F-350|FORD F150|FORD F350|FORD MUSTANG|FORD MUSTANG 07|FORD PROBE|FORD TAURUS|FORD TAURUS LICENSE 0|FORD TEMPO|FORD THUNDERBIRD|FORD TRUCK|FORD,|1980 CAPRI CLASSIC|MUSTANG|TEMPO|THUNDERBIRD|TRANSIT|WHITE 1991 FORD TEMPO", mydata$EquipMake), "New_Make"] <- "Ford"
mydata[grep("1989 HONDA ACCORD|CIVIC|HOJNDA|HONDA|HONDA   ACCORD|HONDA  (CIVIC?)|HONDA  ACCORD|HONDA  CIVIC|HONDA  CIVIC 4 DR.  1|HONDA PRELUDE|HONDA SUV|HONDA?", mydata$EquipMake), "New_Make"] <- "Honda"
mydata[grep("1992 volkswagen|2000 Jetta|jetta|passat|V.W.|volkawagen|volkawagon|VOLKS|volkswagan|volkswagen|volkswagon|Volkswagon  Golf|volkswagon passat|Volkwagon|vw|VW camperized van|vw jetta|VW Passat", mydata$EquipMake), "New_Make"] <- "Volkswagon"
mydata[grep("1993 Buick Regal|buick|buick acheiva|buick alero|buick regal|buiick riveria|alero", mydata$EquipMake), "New_Make"] <- "Buick"
mydata[grep("1994 GMC Extra can lon|G.M.C.|GMC|GMC - Sierra|GMC  Jimmy|GMC  Tracker|GMC 3500|Gmc 3500 Truck|gmc 4x4  stolen bc  77|GMC Discovery|GMC Jimmy|GMC SAFARI|gmc sierra truck|GMC van|gmc vandura|GMC Vanguard|GMC/Chevrolet|vanguard", mydata$EquipMake), "New_Make"] <- "GMC"
mydata[grep("2005 audi a4 1.8 l|audi", mydata$EquipMake), "New_Make"] <- "Audi"

(这些只是前几行代码 - 全部有90行)

当我查看输出时,新列中应该有90个不同的品牌,但只有21个有效(剩下的就是&#34;未知&#34;)。在上面的代码中,Drummond,Audi和Buick都没有用。

有人能告诉我为什么这不起作用吗?或者,让我指出一些可行的方向?

我在使用R时相当新,所以解释越简单越好:)

谢谢!

1 个答案:

答案 0 :(得分:0)

您应该考虑数据所需的格式。看起来您正在记录每个输入值并写下与其对应的品牌,并且您想要查找EquipMake的所有出现并为其分配适当的New_Make值。正如复员在评论中指出的那样,还有其他方法可以解决这个问题。但是,如果您采用这种方法,那么尝试grep每个值的方法要容易得多。创建一个包含两列(EquipMake和New_Make)的tidy数据集,以及每个要重新编码的EquipMake值的一行。然后通过left_joindplyr包中的tidyverse函数中的library(tidyverse) # Should be part of all data science workflows ############################### # Generate data grep_data <- c("?|???|N/A|NA|NONE PROVIDED|NONE GIVEN|NOT PROVIDED|U/K|UNKNOWN|UNKNOWJN|UNLISTED|poss / pontiac|possible cutlass|possibly Honda|UNAVAILABLE|unlisted", "021CVG|1993|1B3BP44KLYN100171|20 FOOT|3 WHEELER|301 DORSEY|AREIE K CAR|BLACKWOOD HODGE|BLUE BIRD|BOLER|CAVCO|CLAYNOR TRAILER SALES|CMC|COMFORT|CRAFTSMAN|CUSTOM BUILT|DIESEL|FIFTH WHEEL|GARAGE TRUCK|GILLNETTER|GRUMAN|GRUHMANN|GULF STREAM|HONDAY|INTERNATIONAL|INTERNATIONAL - EAGLE|K-CAR|KING OF THE ROAD|KOBELCO|MCI|MIDAS CHATEAU|NATIONAL|OKANAGAN|ORCA|ORD|PHMAN|PICKUP|SCOOTER|SEDAN|TORO|TRAILER|UTILITY TRAILER|WABASH|WILDERNESS|AMER|AMER. MOTORS|DAMON|DAMON CORP|FREIGHT TRAILER", "1979 DRUMMOND", "APPEARED TO BE A FORD|CORSAIR|FORD|FORC F-150|FORD AEROSTAR|FORD ?|FORD 150 XLT LIGHT B|FORD E-350|FORD EXPLORER|FORD EXPLORER?|FORD F-150|FORD F-350|FORD F150|FORD F350|FORD MUSTANG|FORD MUSTANG 07|FORD PROBE|FORD TAURUS|FORD TAURUS LICENSE 0|FORD TEMPO|FORD THUNDERBIRD|FORD TRUCK|FORD,|1980 CAPRI CLASSIC|MUSTANG|TEMPO|THUNDERBIRD|TRANSIT|WHITE 1991 FORD TEMPO", "1989 HONDA ACCORD|CIVIC|HOJNDA|HONDA|HONDA ACCORD|HONDA (CIVIC?)|HONDA ACCORD|HONDA CIVIC|HONDA CIVIC 4 DR. 1|HONDA PRELUDE|HONDA SUV|HONDA?", "1992 volkswagen|2000 Jetta|jetta|passat|V.W.|volkawagen|volkawagon|VOLKS|volkswagan|volkswagen|volkswagon|Volkswagon Golf|volkswagon passat|Volkwagon|vw|VW camperized van|vw jetta|VW Passat", "1993 Buick Regal|buick|buick acheiva|buick alero|buick regal|buiick riveria|alero", "1994 GMC Extra can lon|G.M.C.|GMC|GMC - Sierra|GMC Jimmy|GMC Tracker|GMC 3500|Gmc 3500 Truck|gmc 4x4 stolen bc 77|GMC Discovery|GMC Jimmy|GMC SAFARI|gmc sierra truck|GMC van|gmc vandura|GMC Vanguard|GMC/Chevrolet|vanguard", "2005 audi a4 1.8 l|audi") make_data <- c("Unknown", "Other", "Drummond", "Ford", "Honda", "Volkswagen", "Buick", "GMC", "Audi") raw_reference <- tibble(grep_data, make_data) make_replacement_table <- function(namestring) { strsplit(namestring[1], split = "|", fixed = TRUE ) %>% unlist %>% tibble(., namestring[2]) %>% set_names(c("EquipMake", "New_Make")) } # Generate a dataset that has both known and unknown values for EquipMake mydata <- sample(reference_table$EquipMake, size = 1000, replace = TRUE) %>% tbl_df %>% set_names("EquipMake") ############################### # The answer to your question # Create the lookup table containing original and replacement values # You could create the table in Excel and import with readr::read_csv() reference_table <- apply(raw_reference, 1, make_replacement_table) %>% do.call(rbind.data.frame, .) # Now join reference_table against your raw data # Any values of EquipMake you haven't coded will be NA mydata <- mydata %>% left_join(reference_table) 函数将该数据集加入主数据。

0:14:19 Cannot copy 'C:\Users\Workspace\AndroidStudio\Event\app\build\intermediates\exploded-aar\com.android.support.test\runner\0.5\jars\classes.jar'
to 'C:\Users\pc\.AndroidStudio2.2\system\jars\classes.jar'.
        Reason: null.