从R

时间:2018-03-22 18:19:15

标签: r mapping state country textmatching

此问题可能看似重复,但我在从字符串中提取国家/地区名称时遇到了一些问题。我已经浏览了这个链接[link] Extracting Country Name from Author Affiliations但是我无法解决我的问题。我已经尝试过grepl和for循环进行文本匹配和替换,我的数据列包含超过300k行,所以使用grepl和用于模式匹配的for循环非常慢。

我有一个这样的专栏。

org_loc

Zug
Zug  Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza 
York  United Kingdom
Delhi
Yalleroi  Queensland
Waterloo  Ontario
Waterloo  ON 
Washington  D.C.
Washington D.C. Metro 
New York


df$org_loc <- c("zug", "zug  canton of zug", "zimbabwe", 
"zigong", "zhuhai", "zaragoza","York  United Kingdom", "Delhi","Yalleroi  Queensland","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","New York")

字符串可能包含州,城市或国家/地区的名称。我只想要国家作为输出。喜欢这个

org_loc

Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state

我正在尝试使用国家/地区代码库将状态(如果找到匹配项)转换为其国家/地区,但无法执行此操作。任何帮助都会很明显。

3 个答案:

答案 0 :(得分:0)

library(countrycode)
df <- c("zug  switzerland", "zug  canton of zug  switzerland", "zimbabwe", 
            "zigong  chengdu  pr china", "zhuhai  guangdong  china", "zaragoza","York  United Kingdom", "Yamunanagar","Yalleroi  Queensland  Australia","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')

它与很多人不匹配,但根据countrycode的参考手册,这应该可以满足您的需求。

答案 1 :(得分:0)

使用来自包ggmap的功能地理编码,您可以完成任务的良好但不完全准确;您还必须使用您的标准来说“萨拉戈萨”是西班牙的一个城市(这是地理编码返回的地方)而不是阿根廷的某个地方;当有多个同音异义词时,地理编码往往会给你最大的城市。 (删除$ country以查看所有输出)

library(ggmap)
org_loc <- c("zug", "zug  canton of zug", "zimbabwe", 
                "zigong", "zhuhai", "zaragoza","York  United Kingdom", 
             "Delhi","Yalleroi  Queensland","Waterloo  Ontario","Waterloo  ON","Washington  D.C.","Washington D.C. Metro","New York")
    geocode(org_loc, output = "more")$country

由于地理编码是由谷歌提供的,它有一个查询限制,每个IP地址每天2,500;如果它返回NA,可能是因为不一致的限制检查,只需再试一次。

答案 2 :(得分:0)

您可以将City_and_province_list.csv用作countrycode的自定义词典。自定义词典在原始向量(City中的City_and_province_list.csv列)中不能有重复项,因此您必须先删除它们或以某种方式处理它们(如下面的示例所示) )。目前,您的查找CSV中没有示例中的所有可能字符串,因此它们并未全部转换,但如果您将所有可能的字符串添加到CSV中,则它将完全正常工作。

library(countrycode)

org_loc <- c("Zug", "Zug  Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
             "Zaragoza", "York  United Kingdom", "Delhi",
             "Yalleroi  Queensland", "Waterloo  Ontario", "Waterloo  ON",
             "Washington  D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)

city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")

# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]

df$country <- countrycode(df$org_loc, "City", "Country", 
                          custom_dict = city_country)

df
# org_loc                  country
# 1                    Zug              Switzerland
# 2     Zug  Canton of Zug                     <NA>
# 3               Zimbabwe                     <NA>
# 4                 Zigong                    China
# 5                 Zhuhai                    China
# 6               Zaragoza                    Spain
# 7   York  United Kingdom                     <NA>
# 8                  Delhi                    India
# 9   Yalleroi  Queensland                     <NA>
# 10     Waterloo  Ontario                     <NA>
# 11          Waterloo  ON                     <NA>
# 12      Washington  D.C.                     <NA>
# 13 Washington D.C. Metro                     <NA>
# 14              New York United States of America