Question

我可能没有在标题中使用适当的语言。如果需要编辑，请随意。

我想为unicode字符取一个"byte"个替换字符串，然后将它们转换回unicode。我们说我有：

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

我想回来：

"bißchen Zürcher hello world Æ"

我知道如果我能将它打印到这个表单，它会根据需要打印到控制台：

"bi\xdfchen Z\xfcrcher \xc6"

我试过了：

gsub("<([[a-z0-9]+)>", "\\x\\1", x)
## [1] "bixdfchen Zxfcrcher xc6"

Answer 1

这个怎么样：

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x, m)
chars <- lapply(codes, function(x) {
    rawToChar(as.raw(strtoi(paste0("0x", substr(x,2,3)))), multiple = TRUE)
})

regmatches(x, m) <- chars

x
# [1] "bi\xdfchen Z\xfcrcher hello world \xc6"

Encoding(x) <- "latin1"
x
# [1] "bißchen Zürcher hello world Æ"

请注意，您不能通过将“\ x”粘贴到数字的前面来制作转义字符。那个“\ x”真的不在字符串中。这就是R选择在屏幕上表示它的方式。这里使用rawToChar()将数字转换为我们想要的字符。

我在Mac上对此进行了测试，因此我必须将编码设置为“latin1”以在控制台中查看正确的符号。只使用像这样的单字节是不正确的UTF-8。

Answer 2

您也可以使用gsubfn库。

library(gsubfn)
f <- function(x) rawToChar(as.raw(as.integer(paste0("0x", x))), multiple=T)
gsubfn("<([0-9a-f]{2})>", f, "bi<df>chen Z<fc>rcher hello world <c6>")
## [1] "bißchen Zürcher hello world Æ"

将字节编码转换为unicode

2 个答案: