Question

我有一个包含国家名称和出口的数据集，但是国家名称是西班牙语。我想使用countrycode来获取正确的代码来进行映射，但是countrycode仅从英语/德语转换为其他语言（而不是相反）。我尝试使用translateR将西班牙语名称更改为英语，但是新列输出的内容与原始名称完全相同。

我不认为这是我的API，因为最初我遇到了另一个错误，但是后来我重新启动R并消失了。代码中有东西吗？

#read in file
data <- read.csv("...", header = TRUE)
data$char <- as.character(data$pais_descripcion)

#translate
library(translateR)

google.dataset.out <- translate(dataset = data,
content.field = 'char',
google.api.key = 'key',
source.lang = 'es',
target.lang = 'en')

Dataset with countries(pais_descripcion)

Answer 1

countrycode确实内置了西班牙国家/地区名称，但是默认情况下无法将其作为源代码访问。您可以通过创建和使用自定义词典来解决此问题，如下例所示。缺点是它与正则表达式不匹配，因此名称必须完全匹配（包括区分大小写）。如果您能够并且愿意为西班牙的国家/地区名称创建一套正则表达式，我们将非常高兴和感激将它们作为默认的可访问原始代码集成到countrycode中（可以在{{3}处提交） }。

library(countrycode)

custom_dict <- data.frame(spanish = countrycode::codelist$cldr.name.es,
                          english = countrycode::codelist$cldr.name.en,
                          stringsAsFactors = FALSE)

countries <- c("España", "Alemania")

countrycode(countries, "spanish", "english", custom_dict = custom_dict)
# [1] "Spain"   "Germany"

它与您使用的数据中的国家/地区名称匹配92％，这至少是一个好的开始。您可以将不匹配的国家/地区名称条目添加到自定义词典中，以匹配所有国家/地区。

library(countrycode)

url <- "https://catalogo.datos.gba.gob.ar/dataset/46b85203-17fe-42bd-b13f-1d3e150c06cd/resource/3eb20f55-7dc0-4671-a039-cf2e4b71c3db/download/expo_2016_2017.xlsx-expo.csv"
data <- read.csv(url, stringsAsFactors = FALSE)

custom_dict <- data.frame(spanish = countrycode::codelist$cldr.name.es,
                          english = countrycode::codelist$cldr.name.en,
                          stringsAsFactors = FALSE)

results <- countrycode(data$pais_descripcion, "spanish", "english", custom_dict = custom_dict)

sum(is.na(results))
# [1] 3902

sum(!is.na(results))
# [1] 45060

使用translateR

1 个答案: