此问题可能看似重复,但我在从字符串中提取国家/地区名称时遇到了一些问题。我已经浏览了这个链接[link] Extracting Country Name from Author Affiliations但是我无法解决我的问题。我已经尝试过grepl和for循环进行文本匹配和替换,我的数据列包含超过300k行,所以使用grepl和用于模式匹配的for循环非常慢。
我有一个这样的专栏。
org_loc
Zug
Zug Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza
York United Kingdom
Delhi
Yalleroi Queensland
Waterloo Ontario
Waterloo ON
Washington D.C.
Washington D.C. Metro
New York
df$org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom", "Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
字符串可能包含州,城市或国家/地区的名称。我只想要国家作为输出。喜欢这个
org_loc
Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state
我正在尝试使用国家/地区代码库将状态(如果找到匹配项)转换为其国家/地区,但无法执行此操作。任何帮助都会很明显。
答案 0 :(得分:0)
library(countrycode)
df <- c("zug switzerland", "zug canton of zug switzerland", "zimbabwe",
"zigong chengdu pr china", "zhuhai guangdong china", "zaragoza","York United Kingdom", "Yamunanagar","Yalleroi Queensland Australia","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')
它与很多人不匹配,但根据countrycode
的参考手册,这应该可以满足您的需求。
答案 1 :(得分:0)
使用来自包ggmap的功能地理编码,您可以完成任务的良好但不完全准确;您还必须使用您的标准来说“萨拉戈萨”是西班牙的一个城市(这是地理编码返回的地方)而不是阿根廷的某个地方;当有多个同音异义词时,地理编码往往会给你最大的城市。 (删除$ country以查看所有输出)
library(ggmap)
org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom",
"Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
geocode(org_loc, output = "more")$country
由于地理编码是由谷歌提供的,它有一个查询限制,每个IP地址每天2,500;如果它返回NA,可能是因为不一致的限制检查,只需再试一次。
答案 2 :(得分:0)
您可以将City_and_province_list.csv
用作countrycode
的自定义词典。自定义词典在原始向量(City
中的City_and_province_list.csv
列)中不能有重复项,因此您必须先删除它们或以某种方式处理它们(如下面的示例所示) )。目前,您的查找CSV中没有示例中的所有可能字符串,因此它们并未全部转换,但如果您将所有可能的字符串添加到CSV中,则它将完全正常工作。
library(countrycode)
org_loc <- c("Zug", "Zug Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
"Zaragoza", "York United Kingdom", "Delhi",
"Yalleroi Queensland", "Waterloo Ontario", "Waterloo ON",
"Washington D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)
city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")
# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]
df$country <- countrycode(df$org_loc, "City", "Country",
custom_dict = city_country)
df
# org_loc country
# 1 Zug Switzerland
# 2 Zug Canton of Zug <NA>
# 3 Zimbabwe <NA>
# 4 Zigong China
# 5 Zhuhai China
# 6 Zaragoza Spain
# 7 York United Kingdom <NA>
# 8 Delhi India
# 9 Yalleroi Queensland <NA>
# 10 Waterloo Ontario <NA>
# 11 Waterloo ON <NA>
# 12 Washington D.C. <NA>
# 13 Washington D.C. Metro <NA>
# 14 New York United States of America