Question

当我读到一个确定的西班牙语网站时，我得到了HTML编码的西班牙语口音。我用readLines函数阅读了网站（我需要使用这个函数）。

url <- "http://www.senamhi.gob.pe/include_mapas/_map_data_hist03.php?drEsta=01"
char_data <- readLines(url,encoding="UTF-8")

在完成所有操作以获取我的数据后，我有一个数据框，其中我有一个变量，其字符值是带重音的单词。它会是这样的：

var <- rep("Meteorol&oacute;gica",5)

我需要将HTML编码中的西班牙语重音转换为正常的西班牙语重音符号。我使用iconv函数

进行了测试

iconv(var, "UTF-8", "ASCII")

但它没有用，我得到了相同的输入字符向量。我还测试了在encoding函数中更改readLines选项，但都不起作用。

我该怎么办？感谢。

Answer 1

我不知道R，但是如果你可以在其中包含一行javascript，那就是这一行：

var encoded = 'H&oacute;la';
var notEncoded = encoded.replace("&oacute;", "ò");

然后，在notEncoded程序中获取.R值。

Answer 2

为什么不查找所有HTML &codes;的重音字符然后查找/替换？

library(rvest)

# scrape lookup table of accented char html codes, from the 2nd table on this page
ref_url <- 'http://www.w3schools.com/charsets/ref_html_8859.asp'
char_table <- html(ref_url) %>% html_table %>% `[[`(2)
# fix names
names(char_table) <- names(char_table) %>% tolower %>% gsub(' ', '_', .)

# here's a test string loaded with different html accents
test_str <- '&Agrave; &Aacute; &Acirc; &Atilde; &Auml; &Aring; &AElig; &Ccedil; &Egrave; &Eacute; &Ecirc; &Euml; &Igrave; &Iacute; &Icirc; &Iuml; &ETH; &Ntilde; &Ograve; &Oacute; &Ocirc; &Otilde; &Ouml; &times; &Oslash; &Ugrave; &Uacute; &Ucirc; &Uuml; &Yacute; &THORN; &szlig; &agrave; &aacute; &acirc; &atilde; &auml; &aring; &aelig; &ccedil; &egrave; &eacute; &ecirc; &euml; &igrave; &iacute; &icirc; &iuml; &eth; &ntilde; &ograve; &oacute; &ocirc; &otilde; &ouml; &divide; &oslash; &ugrave; &uacute; &ucirc; &uuml; &yacute; &thorn; &yuml;'

# use mgsub from here (it's just gsub with a for loop)
# http://stackoverflow.com/questions/15253954/replace-multiple-arguments-with-gsub
mgsub(char_table$entity_name, char_table$character, test_str)

瞧à，你就是：

“À¢ÃÃÆÇÈÊËËÏÏÏÏ¢Õ¢Ö¢ÖÖ¢Ö¢Ú¢Ã¢Ã¢Ã¢Ãåååææè îïòóóõ÷÷ùúúúÿÿÿ“

在R中转换html西班牙语口音

2 个答案: