我有一些不同的数据集,其中包含不一致的国家/地区名称。我想对国名进行某种模糊合并。
所以,我有伊朗(I.R.)和伊朗,伊斯兰共和国,我希望他们在“合并”中是等同的。或join_all。
我可以容忍匹配中的错误,我只是希望在没有做太多工作的情况下进行改进。
[1] "Afghanistan"
[2] "Albania"
[3] "Algeria"
[4] "American Samoa"
[5] "Andorra"
[6] "Angola"
[7] "Anguilla"
[8] "Antigua and Barbuda"
[9] "Antigua & Barbuda"
[10] "Arab World"
[11] "Argentina"
[12] "Armenia"
[13] "Aruba"
[14] "Ascension"
[15] "Australia"
[16] "Austria"
[17] "Azerbaijan"
[18] "Bahamas"
[19] "Bahamas, The"
[20] "Bahrain"
[21] "Bangladesh"
[22] "Barbados"
[23] "Belarus"
[24] "Belgium"
[25] "Belize"
[26] "Benin"
[27] "Bermuda"
[28] "Bhutan"
[29] "Bolivia"
[30] "Bosnia and Herzegovina"
[31] "Botswana"
[32] "Brazil"
[33] "British Virgin Islands"
[34] "Brunei Darussalam"
[35] "Bulgaria"
[36] "Burkina Faso"
[37] "Burundi"
[38] "Cabo Verde"
[39] "Cambodia"
[40] "Cameroon"
[41] "Canada"
[42] "Cape Verde"
[43] "Caribbean small states"
[44] "Cayman Islands"
[45] "Central African Rep."
[46] "Central African Republic"
[47] "Chad"
[48] "Channel Islands"
[49] "Chile"
[50] "China"
[51] "Cocos Keeling Islands"
[52] "Colombia"
[53] "Comoros"
[54] "Congo"
[55] "Congo (Dem. Rep.)"
[56] "Congo, Dem. Rep."
[57] "Congo, Rep."
[58] "Costa Rica"
[59] "Cote d'Ivoire"
[60] "Côte d'Ivoire"
[61] "Croatia"
[62] "Cuba"
[63] "Curacao"
[64] "Cyprus"
[65] "Czech Republic"
[66] "Denmark"
[67] "Djibouti"
[68] "Dominica"
[69] "Dominican Rep."
[70] "Dominican Republic"
[71] "D.P.R. Korea"
[72] "East Asia and the Pacific (IFC classification)"
[73] "East Asia & Pacific (all income levels)"
[74] "East Asia & Pacific (developing only)"
[75] "Ecuador"
[76] "Egypt"
[77] "Egypt, Arab Rep."
[78] "El Salvador"
[79] "Equatorial Guinea"
[80] "Eritrea"
[81] "Estonia"
[82] "Ethiopia"
[83] "Euro area"
[84] "Europe and Central Asia (IFC classification)"
[85] "European Union"
[86] "Europe & Central Asia (all income levels)"
[87] "Europe & Central Asia (developing only)"
[88] "Faeroe Islands"
[89] "Falkland (Malvinas) Is."
[90] "Faroe Islands"
[91] "Fiji"
[92] "Finland"
[93] "France"
[94] "French Polynesia"
[95] "Gabon"
[96] "Gambia"
[97] "Gambia, The"
[98] "Georgia"
[99] "Germany"
[100] "Ghana"
[101] "Gibraltar"
[102] "Greece"
[103] "Greenland"
[104] "Grenada"
[105] "Guam"
[106] "Guatemala"
[107] "Guernsey"
[108] "Guinea"
[109] "Guinea-Bissau"
[110] "Guyana"
[111] "Haiti"
[112] "Heavily indebted poor countries (HIPC)"
[113] "High income"
[114] "High income: nonOECD"
[115] "High income: OECD"
[116] "Honduras"
[117] "Hong Kong, China"
[118] "Hong Kong SAR, China"
[119] "Hungary"
[120] "Iceland"
[121] "India"
[122] "Indonesia"
[123] "Iran (I.R.)"
[124] "Iran, Islamic Rep."
[125] "Iraq"
[126] "Ireland"
[127] "Isle of Man"
[128] "Israel"
[129] "Italy"
[130] "Jamaica"
[131] "Japan"
[132] "Jersey"
[133] "Jordan"
[134] "Kazakhstan"
[135] "Kenya"
[136] "Kiribati"
[137] "Korea, Dem. Rep."
[138] "Korea (Rep.)"
[139] "Korea, Rep."
[140] "Kosovo"
[141] "Kuwait"
[142] "Kyrgyz Republic"
[143] "Kyrgyzstan"
[144] "Lao PDR"
[145] "Lao P.D.R."
[146] "Latin America and the Caribbean (IFC classification)"
[147] "Latin America & Caribbean (all income levels)"
[148] "Latin America & Caribbean (developing only)"
[149] "Latvia"
[150] "Least developed countries: UN classification"
[151] "Lebanon"
[152] "Lesotho"
[153] "Liberia"
[154] "Libya"
[155] "Liechtenstein"
[156] "Lithuania"
[157] "Lower middle income"
[158] "Low income"
[159] "Low & middle income"
[160] "Luxembourg"
[161] "Macao, China"
[162] "Macao SAR, China"
[163] "Macedonia, FYR"
[164] "Madagascar"
[165] "Malawi"
[166] "Malaysia"
[167] "Maldives"
[168] "Mali"
[169] "Malta"
[170] "Marshall Islands"
[171] "Mauritania"
[172] "Mauritius"
[173] "Mayotte"
[174] "Mexico"
[175] "Micronesia"
[176] "Micronesia, Fed. Sts."
[177] "Middle East and North Africa (IFC classification)"
[178] "Middle East & North Africa (all income levels)"
[179] "Middle East & North Africa (developing only)"
[180] "Middle income"
[181] "Moldova"
[182] "Monaco"
[183] "Mongolia"
[184] "Montenegro"
[185] "Montserrat"
[186] "Morocco"
[187] "Mozambique"
[188] "Myanmar"
[189] "Namibia"
[190] "Nauru"
[191] "Nepal"
[192] "Neth. Antilles"
[193] "Netherlands"
[194] "New Caledonia"
[195] "New Zealand"
[196] "Nicaragua"
[197] "Niger"
[198] "Nigeria"
[199] "Niue"
[200] "Norfolk Islands"
[201] "North America"
[202] "Northern Mariana Islands"
[203] "Northern Marianas"
[204] "Norway"
[205] "Not classified"
[206] "OECD members"
[207] "Oman"
[208] "Other small states"
[209] "Pacific island small states"
[210] "Pakistan"
[211] "Palau"
[212] "Palestinian Authority"
[213] "Panama"
[214] "Papua New Guinea"
[215] "Paraguay"
[216] "Peru"
[217] "Philippines"
[218] "Poland"
[219] "Portugal"
[220] "Puerto Rico"
[221] "Qatar"
[222] "Romania"
[223] "Russia"
[224] "Russian Federation"
[225] "Rwanda"
[226] "Samoa"
[227] "San Marino"
[228] "Sao Tome and Principe"
[229] "Saudi Arabia"
[230] "Senegal"
[231] "Serbia"
[232] "Seychelles"
[233] "Sierra Leone"
[234] "Singapore"
[235] "Sint Maarten (Dutch part)"
[236] "Slovak Republic"
[237] "Slovenia"
[238] "Small states"
[239] "Solomon Islands"
[240] "Somalia"
[241] "South Africa"
[242] "South Asia"
[243] "South Asia (IFC classification)"
[244] "South Sudan"
[245] "Spain"
[246] "Sri Lanka"
[247] "St. Helena"
[248] "St. Kitts and Nevis"
[249] "St. Lucia"
[250] "St. Martin (French part)"
[251] "S. Tomé & Principe"
[252] "St. Pierre & Miquelon"
[253] "St. Vincent and the Grenadines"
[254] "Sub-Saharan Africa (all income levels)"
[255] "Sub-Saharan Africa (developing only)"
[256] "Sub-Saharan Africa (IFC classification)"
[257] "Sudan"
[258] "Suriname"
[259] "Swaziland"
[260] "Sweden"
[261] "Switzerland"
[262] "Syria"
[263] "Syrian Arab Republic"
[264] "Taiwan, Province of China"
[265] "Tajikistan"
[266] "Tanzania"
[267] "TFYR Macedonia"
[268] "Thailand"
[269] "Timor-Leste"
[270] "Togo"
[271] "Tokelau"
[272] "Tonga"
[273] "Trinidad and Tobago"
[274] "Trinidad & Tobago"
[275] "Tunisia"
[276] "Turkey"
[277] "Turkmenistan"
[278] "Turks and Caicos Islands"
[279] "Turks & Caicos Is."
[280] "Tuvalu"
[281] "Uganda"
[282] "Ukraine"
[283] "United Arab Emirates"
[284] "United Kingdom"
[285] "United States"
[286] "Upper middle income"
[287] "Uruguay"
[288] "Uzbekistan"
[289] "Vanuatu"
[290] "Vatican"
[291] "Venezuela"
[292] "Venezuela, RB"
[293] "Viet Nam"
[294] "Vietnam"
[295] "Virgin Islands (U.S.)"
[296] "Virgin Islands (US)"
[297] "Wallis and Futuna"
[298] "West Bank and Gaza"
[299] "World"
[300] "Yemen"
[301] "Yemen, Rep."
[302] "Zambia"
[303] "Zimbabwe"
编辑:数据集来自两个来源。这些名称在来源内是一致的,但不在。之间。
答案 0 :(得分:2)
我应该先说,这不是一个模糊匹配解决方案。这是一次"做一次工作,再也没有想到它的解决方案"。
一般情况下,特别是如果我必须经常进行此类操作,我会使用以下步骤。对于特定行业内的公司名称,此流程也非常有效(我将其用于加拿大/美国/欧洲金融产品制造商)。
让m
成为您的国家/地区名称向量。
m <- as.character(m) # convert to character
m <- gsub("."," ",m) # remove "."
m <- gsub(","," ",m) # remove comma (and so on)
m <- tolower(m) # might fail if you have lots of special characters
m <- gsub("\\s+|\\s+$","",m) # strip whitespace
按字母顺序排列,开始这样:
m[grep("afghanist")] <- "Afghanistan"
m[grep("alban")] <- "Albania"
...
m[grep("iran")] <- "Islamic Republic of Iran"
...
m[grep("usa")] <- "United States of America"
m[grep("yemen")] <- "Yemen"
在大多数情况下,您不会需要整个国家/地区名称,因为它是一个小列表。最后,将此信息保存到列表中,并使用无与伦比的自己的向量进行进一步审核。
verbatims <- m
# Unmatched = anything without a capital
unmatched <- which(!substr(m,1,1) %in% LETTERS[1:26])
unmatched <- m[unmatched]
verbatims[unmatched] <- "Other" # Or however you need to recode it
通过消除过程,开始为所有&#34;无法匹配的&#34;更新代码。
protip:如果您使用=concatenate()