Question

我有以下矢量：

goal <- list("Chamberlain", "Romañach", "<node>")

我想将其转换为带有UTF-8的unicode字符的向量，如下所示：

wouldbenice <- "Roma\u00F1ach"

解压缩的字符串导致问题。如果是第二个字符串：

enc2native(wouldbenice)

然后lapply(x, enc2native)会做正确的事。（或# displays "Romañach" eval(parse(text = x[[2]]))表示整个字符串。

我可以使用以下代码在UTF-8中正确显示第二个字符串：

x[1]

但是，x[2]和pip install tensorflow的情况很糟糕（抛出解析错误）。如何可靠地将整个列表解析为适当的编码？

Answer 1

使用stringi包。

从stringi使用stri_replace_all_regex进行替换，将stri_unescape_unicode用于unescape Unicode符号。

library(stringi)

x <- list("Chamberlain", "\"Roma\\u00F1ach\"", "<node>")

removed_quotes <- stri_replace_all_regex(x, "\"", "")

unescaped <- stri_unescape_unicode(removed_quotes)

# [1] "Chamberlain" "Romañach"    "<node>"

Answer 2

这满足了基础R的目标，但在其他方面似乎不太理想。把它放在这里，以便读者可以比较，虽然我认为基于stringi的解决方案可能是要走的路。

utf8me <- function(x){ 
  i <- grepl('\\u', x) # Not a robust way to detect a unicode char?
  x[i] <- eval(parse(text=x[i])) # 
  x
  }

lapply(x, utf8me)

转换去除字符串的编码

2 个答案: