Question

我有一个R“插件”，它从stdin读取一堆行，解析它并对其进行评估。

...
code <- readLines(f, warn=F)   ## that's where the lines come from...
result <- eval(parse(text=code))
...

现在，有时提供代码行的系统会在代码中插入UTF-8非中断空格（U+00A0 = \xc2\xa0）。 parse()扼杀了这些角色。例如：

s <- "1 +\xc2\xa03"
s
[1] "1 + 3"   ## looks fine doesn't it? In fact, the Unicode "NON-BREAK SPACE" is there

eval(parse(text=s))
Error in parse(text = s) : <text>:1:4: unexpected input
1: 1 +?
      ^

eval(parse(text=gsub("\xc2\xa0"," ",s)))
[1] 4

我想用常规空格替换那个角色，并且可以这样做（但我自己也有危险），如上所述：

code <- gsub('\xc2\xa0',' ',code)

然而，这并不干净，因为字节序列'\xc2\a0'可以想象在另一个2字节字符的中间开始匹配，其第二个字节是0xc2。

或许好一点，我们可以说：

code <- gsub(intToUtf8(0x00a0L),' ',code)

但这不会推广到UTF-8字符串。

当然有一种更好，更有表现力的方式来输入包含一些UTF-8字符的字符串？一般来说，表达UTF-8字符串的正确方法是什么（这里是sub()的模式参数）？

编辑：要清楚，我有兴趣通过指定其十六进制值在String中输入UTF-8字符。请考虑以下示例（请注意"é"是Unicode U+00E9，可以用UTF-8表示为0xc3a9）：

s <- "Cet été."
gsub("té","__",s)
# --> "Cet é__."
# works, but I like to keep my code itself free of UTF-8 literals,
# plus, for the initial question, I really don't want to enter an actual
# UTF-8 "NON BREAKABLE SPACE" in my code as it would be undistinguishable
# from a regular space.

gsub("t\xc3\xa9","__",s)  ## works, but I question how standard and portable
# --> "Cet é__."

gsub("t\\xc3\\xa9","__",s)  ## doesn't work
# --> "Cet été."

gsub("t\x{c3a9}","__",s)  ## would work in Perl, doesn't seem to work in R
# Error: '\x' used without hex digits in character string starting "s\x"

Answer 1

（早先的东西已删除。）

EDIT2：

> s <- '\U00A0'
> s
[1] " "
> code <- gsub(s, '__','\xc2\xa0' )
> code
[1] "__"

替换R中的UTF-8字符

1 个答案: