Question

我的文字有一些HTML转义字符，例如'而不是'。c("'"="'", ...)。现在我想揭开这些序列。由于我不知道转义了哪些字符，因此我不想使用'中的简单映射。

我理解＆符号后面的数字是十进制的unicode数字。所以\u27是sprintf("\u%x", s)，因为27是39的十六进制表示。所以我想一个涉及的解决方案

其中&是;和{{1}}之间的提取数字。但是，这会导致错误：“\ u使用没有十六进制数字。”

将HTML转义序列转换回字符的更好方法是什么？

Answer 1

仅供参考，这是我提出的解决方案。它使用了很棒的包gsubfn：

library(gsubfn)

我使用向量htmlchars为我从Wikipedia抓取的命名html实体。为简洁起见，我不在此处复制此答案中的向量，而是从pastebin：

中获取

source("http://pastebin.com/raw.php?i=XtzN1NMs") # creates variable htmlchars

现在我正在寻找的解码功能很简单：

strdehtml <- function(s) {
    ret <- gsubfn("&#([0-9]+);", function(x) rawToChar(as.raw(as.numeric(x))), s)
    ret <- gsubfn("&([^;]+);", function(x) htmlchars[x], ret)
    return(ret)
}

不确定这是否包含所有可能的HTML字符，但它让我工作。例如，它可以这样使用：

test <- "My this &amp; last year&#39;s resolutions"
strdehtml(test)
[1] "My this & last year's resolutions"

Answer 2

您也可以从R调用Node.JS作为系统命令。 Node.JS具有完全相同的包。以下是说明：

安装Node.JS和NPM apt-get install nodejs npm
安装html-entities包npm install html-entities
从R

print(system(command = "nodejs -e \"var Entities = require('html-entities').AllHtmlEntities; entities = new Entities(); console.log(entities.decode('<>"'&©®∆'));\""))

以下是上述命令的输出

<>"'&©®∆

Unescape HTML＆amp; #nn;序列

2 个答案: