如何在模糊字段名上进行合并?

时间:2014-04-21 23:20:09

标签: r

我有一些不同的数据集,其中包含不一致的国家/地区名称。我想对国名进行某种模糊合并。

所以,我有伊朗(I.R.)和伊朗,伊斯兰共和国,我希望他们在“合并”中是等同的。或join_all。

我可以容忍匹配中的错误,我只是希望在没有做太多工作的情况下进行改进。

  [1] "Afghanistan"                                         
  [2] "Albania"                                             
  [3] "Algeria"                                             
  [4] "American Samoa"                                      
  [5] "Andorra"                                             
  [6] "Angola"                                              
  [7] "Anguilla"                                            
  [8] "Antigua and Barbuda"                                 
  [9] "Antigua & Barbuda"                                   
 [10] "Arab World"                                          
 [11] "Argentina"                                           
 [12] "Armenia"                                             
 [13] "Aruba"                                               
 [14] "Ascension"                                           
 [15] "Australia"                                           
 [16] "Austria"                                             
 [17] "Azerbaijan"                                          
 [18] "Bahamas"                                             
 [19] "Bahamas, The"                                        
 [20] "Bahrain"                                             
 [21] "Bangladesh"                                          
 [22] "Barbados"                                            
 [23] "Belarus"                                             
 [24] "Belgium"                                             
 [25] "Belize"                                              
 [26] "Benin"                                               
 [27] "Bermuda"                                             
 [28] "Bhutan"                                              
 [29] "Bolivia"                                             
 [30] "Bosnia and Herzegovina"                              
 [31] "Botswana"                                            
 [32] "Brazil"                                              
 [33] "British Virgin Islands"                              
 [34] "Brunei Darussalam"                                   
 [35] "Bulgaria"                                            
 [36] "Burkina Faso"                                        
 [37] "Burundi"                                             
 [38] "Cabo Verde"                                          
 [39] "Cambodia"                                            
 [40] "Cameroon"                                            
 [41] "Canada"                                              
 [42] "Cape Verde"                                          
 [43] "Caribbean small states"                              
 [44] "Cayman Islands"                                      
 [45] "Central African Rep."                                
 [46] "Central African Republic"                            
 [47] "Chad"                                                
 [48] "Channel Islands"                                     
 [49] "Chile"                                               
 [50] "China"                                               
 [51] "Cocos Keeling Islands"                               
 [52] "Colombia"                                            
 [53] "Comoros"                                             
 [54] "Congo"                                               
 [55] "Congo (Dem. Rep.)"                                   
 [56] "Congo, Dem. Rep."                                    
 [57] "Congo, Rep."                                         
 [58] "Costa Rica"                                          
 [59] "Cote d'Ivoire"                                       
 [60] "Côte d'Ivoire"                                       
 [61] "Croatia"                                             
 [62] "Cuba"                                                
 [63] "Curacao"                                             
 [64] "Cyprus"                                              
 [65] "Czech Republic"                                      
 [66] "Denmark"                                             
 [67] "Djibouti"                                            
 [68] "Dominica"                                            
 [69] "Dominican Rep."                                      
 [70] "Dominican Republic"                                  
 [71] "D.P.R. Korea"                                        
 [72] "East Asia and the Pacific (IFC classification)"      
 [73] "East Asia & Pacific (all income levels)"             
 [74] "East Asia & Pacific (developing only)"               
 [75] "Ecuador"                                             
 [76] "Egypt"                                               
 [77] "Egypt, Arab Rep."                                    
 [78] "El Salvador"                                         
 [79] "Equatorial Guinea"                                   
 [80] "Eritrea"                                             
 [81] "Estonia"                                             
 [82] "Ethiopia"                                            
 [83] "Euro area"                                           
 [84] "Europe and Central Asia (IFC classification)"        
 [85] "European Union"                                      
 [86] "Europe & Central Asia (all income levels)"           
 [87] "Europe & Central Asia (developing only)"             
 [88] "Faeroe Islands"                                      
 [89] "Falkland (Malvinas) Is."                             
 [90] "Faroe Islands"                                       
 [91] "Fiji"                                                
 [92] "Finland"                                             
 [93] "France"                                              
 [94] "French Polynesia"                                    
 [95] "Gabon"                                               
 [96] "Gambia"                                              
 [97] "Gambia, The"                                         
 [98] "Georgia"                                             
 [99] "Germany"                                             
[100] "Ghana"                                               
[101] "Gibraltar"                                           
[102] "Greece"                                              
[103] "Greenland"                                           
[104] "Grenada"                                             
[105] "Guam"                                                
[106] "Guatemala"                                           
[107] "Guernsey"                                            
[108] "Guinea"                                              
[109] "Guinea-Bissau"                                       
[110] "Guyana"                                              
[111] "Haiti"                                               
[112] "Heavily indebted poor countries (HIPC)"              
[113] "High income"                                         
[114] "High income: nonOECD"                                
[115] "High income: OECD"                                   
[116] "Honduras"                                            
[117] "Hong Kong, China"                                    
[118] "Hong Kong SAR, China"                                
[119] "Hungary"                                             
[120] "Iceland"                                             
[121] "India"                                               
[122] "Indonesia"                                           
[123] "Iran (I.R.)"                                         
[124] "Iran, Islamic Rep."                                  
[125] "Iraq"                                                
[126] "Ireland"                                             
[127] "Isle of Man"                                         
[128] "Israel"                                              
[129] "Italy"                                               
[130] "Jamaica"                                             
[131] "Japan"                                               
[132] "Jersey"                                              
[133] "Jordan"                                              
[134] "Kazakhstan"                                          
[135] "Kenya"                                               
[136] "Kiribati"                                            
[137] "Korea, Dem. Rep."                                    
[138] "Korea (Rep.)"                                        
[139] "Korea, Rep."                                         
[140] "Kosovo"                                              
[141] "Kuwait"                                              
[142] "Kyrgyz Republic"                                     
[143] "Kyrgyzstan"                                          
[144] "Lao PDR"                                             
[145] "Lao P.D.R."                                          
[146] "Latin America and the Caribbean (IFC classification)"
[147] "Latin America & Caribbean (all income levels)"       
[148] "Latin America & Caribbean (developing only)"         
[149] "Latvia"                                              
[150] "Least developed countries: UN classification"        
[151] "Lebanon"                                             
[152] "Lesotho"                                             
[153] "Liberia"                                             
[154] "Libya"                                               
[155] "Liechtenstein"                                       
[156] "Lithuania"                                           
[157] "Lower middle income"                                 
[158] "Low income"                                          
[159] "Low & middle income"                                 
[160] "Luxembourg"                                          
[161] "Macao, China"                                        
[162] "Macao SAR, China"                                    
[163] "Macedonia, FYR"                                      
[164] "Madagascar"                                          
[165] "Malawi"                                              
[166] "Malaysia"                                            
[167] "Maldives"                                            
[168] "Mali"                                                
[169] "Malta"                                               
[170] "Marshall Islands"                                    
[171] "Mauritania"                                          
[172] "Mauritius"                                           
[173] "Mayotte"                                             
[174] "Mexico"                                              
[175] "Micronesia"                                          
[176] "Micronesia, Fed. Sts."                               
[177] "Middle East and North Africa (IFC classification)"   
[178] "Middle East & North Africa (all income levels)"      
[179] "Middle East & North Africa (developing only)"        
[180] "Middle income"                                       
[181] "Moldova"                                             
[182] "Monaco"                                              
[183] "Mongolia"                                            
[184] "Montenegro"                                          
[185] "Montserrat"                                          
[186] "Morocco"                                             
[187] "Mozambique"                                          
[188] "Myanmar"                                             
[189] "Namibia"                                             
[190] "Nauru"                                               
[191] "Nepal"                                               
[192] "Neth. Antilles"                                      
[193] "Netherlands"                                         
[194] "New Caledonia"                                       
[195] "New Zealand"                                         
[196] "Nicaragua"                                           
[197] "Niger"                                               
[198] "Nigeria"                                             
[199] "Niue"                                                
[200] "Norfolk Islands"                                     
[201] "North America"                                       
[202] "Northern Mariana Islands"                            
[203] "Northern Marianas"                                   
[204] "Norway"                                              
[205] "Not classified"                                      
[206] "OECD members"                                        
[207] "Oman"                                                
[208] "Other small states"                                  
[209] "Pacific island small states"                         
[210] "Pakistan"                                            
[211] "Palau"                                               
[212] "Palestinian Authority"                               
[213] "Panama"                                              
[214] "Papua New Guinea"                                    
[215] "Paraguay"                                            
[216] "Peru"                                                
[217] "Philippines"                                         
[218] "Poland"                                              
[219] "Portugal"                                            
[220] "Puerto Rico"                                         
[221] "Qatar"                                               
[222] "Romania"                                             
[223] "Russia"                                              
[224] "Russian Federation"                                  
[225] "Rwanda"                                              
[226] "Samoa"                                               
[227] "San Marino"                                          
[228] "Sao Tome and Principe"                               
[229] "Saudi Arabia"                                        
[230] "Senegal"                                             
[231] "Serbia"                                              
[232] "Seychelles"                                          
[233] "Sierra Leone"                                        
[234] "Singapore"                                           
[235] "Sint Maarten (Dutch part)"                           
[236] "Slovak Republic"                                     
[237] "Slovenia"                                            
[238] "Small states"                                        
[239] "Solomon Islands"                                     
[240] "Somalia"                                             
[241] "South Africa"                                        
[242] "South Asia"                                          
[243] "South Asia (IFC classification)"                     
[244] "South Sudan"                                         
[245] "Spain"                                               
[246] "Sri Lanka"                                           
[247] "St. Helena"                                          
[248] "St. Kitts and Nevis"                                 
[249] "St. Lucia"                                           
[250] "St. Martin (French part)"                            
[251] "S. Tomé & Principe"                                  
[252] "St. Pierre & Miquelon"                               
[253] "St. Vincent and the Grenadines"                      
[254] "Sub-Saharan Africa (all income levels)"              
[255] "Sub-Saharan Africa (developing only)"                
[256] "Sub-Saharan Africa (IFC classification)"             
[257] "Sudan"                                               
[258] "Suriname"                                            
[259] "Swaziland"                                           
[260] "Sweden"                                              
[261] "Switzerland"                                         
[262] "Syria"                                               
[263] "Syrian Arab Republic"                                
[264] "Taiwan, Province of China"                           
[265] "Tajikistan"                                          
[266] "Tanzania"                                            
[267] "TFYR Macedonia"                                      
[268] "Thailand"                                            
[269] "Timor-Leste"                                         
[270] "Togo"                                                
[271] "Tokelau"                                             
[272] "Tonga"                                               
[273] "Trinidad and Tobago"                                 
[274] "Trinidad & Tobago"                                   
[275] "Tunisia"                                             
[276] "Turkey"                                              
[277] "Turkmenistan"                                        
[278] "Turks and Caicos Islands"                            
[279] "Turks & Caicos Is."                                  
[280] "Tuvalu"                                              
[281] "Uganda"                                              
[282] "Ukraine"                                             
[283] "United Arab Emirates"                                
[284] "United Kingdom"                                      
[285] "United States"                                       
[286] "Upper middle income"                                 
[287] "Uruguay"                                             
[288] "Uzbekistan"                                          
[289] "Vanuatu"                                             
[290] "Vatican"                                             
[291] "Venezuela"                                           
[292] "Venezuela, RB"                                       
[293] "Viet Nam"                                            
[294] "Vietnam"                                             
[295] "Virgin Islands (U.S.)"                               
[296] "Virgin Islands (US)"                                 
[297] "Wallis and Futuna"                                   
[298] "West Bank and Gaza"                                  
[299] "World"                                               
[300] "Yemen"                                               
[301] "Yemen, Rep."                                         
[302] "Zambia"                                              
[303] "Zimbabwe"                                            

编辑:数据集来自两个来源。这些名称在来源内是一致的,但不在。之间。

1 个答案:

答案 0 :(得分:2)

我应该先说,这不是一个模糊匹配解决方案。这是一次"做一次工作,再也没有想到它的解决方案"。

一般情况下,特别是如果我必须经常进行此类操作,我会使用以下步骤。对于特定行业内的公司名称,此流程也非常有效(我将其用于加拿大/美国/欧洲金融产品制造商)。

  1. 规范化字符串(小写,条纹白色,条形特殊字符)
  2. Alphabetize替换。
  3. 调整为无与伦比。
  4. m成为您的国家/地区名称向量。

    m <- as.character(m) # convert to character
    m <- gsub("."," ",m) # remove "."
    m <- gsub(","," ",m) # remove comma (and so on)
    m <- tolower(m) # might fail if you have lots of special characters
    m <- gsub("\\s+|\\s+$","",m) # strip whitespace
    

    按字母顺序排列,开始这样:

    m[grep("afghanist")] <- "Afghanistan"
    m[grep("alban")] <- "Albania"
    ...
    m[grep("iran")] <- "Islamic Republic of Iran"
    ...
    m[grep("usa")] <- "United States of America"
    m[grep("yemen")] <- "Yemen"
    

    在大多数情况下,您不会需要整个国家/地区名称,因为它是一个小列表。最后,将此信息保存到列表中,并使用无与伦比的自己的向量进行进一步审核。

    verbatims <- m
    
    # Unmatched = anything without a capital
    unmatched <- which(!substr(m,1,1) %in% LETTERS[1:26])
    
    unmatched <- m[unmatched]
    verbatims[unmatched] <- "Other" # Or however you need to recode it
    

    通过消除过程,开始为所有&#34;无法匹配的&#34;更新代码。

    protip:如果您使用=concatenate()

    ,可以使用excel为您构建代码并进行微调