替换R中的UTF-8字符

时间:2013-02-06 02:13:23

标签: r utf-8

我有一个R“插件”,它从stdin读取一堆行,解析它并对其进行评估。

...
code <- readLines(f, warn=F)   ## that's where the lines come from...
result <- eval(parse(text=code))
...

现在,有时提供代码行的系统会在代码中插入UTF-8非中断空格(U+00A0 = \xc2\xa0)。 parse()扼杀了这些角色。例如:

s <- "1 +\xc2\xa03"
s
[1] "1 + 3"   ## looks fine doesn't it? In fact, the Unicode "NON-BREAK SPACE" is there

eval(parse(text=s))
Error in parse(text = s) : <text>:1:4: unexpected input
1: 1 +?
      ^

eval(parse(text=gsub("\xc2\xa0"," ",s)))
[1] 4

我想用常规空格替换那个角色,并且可以这样做(但我自己也有危险),如上所述:

code <- gsub('\xc2\xa0',' ',code)

然而,这并不干净,因为字节序列'\xc2\a0'可以想象在另一个2字节字符的中间开始匹配,其第二个字节是0xc2

或许好一点,我们可以说:

code <- gsub(intToUtf8(0x00a0L),' ',code)

但这不会推广到UTF-8字符串。

当然有一种更好,更有表现力的方式来输入包含一些UTF-8字符的字符串?一般来说,表达UTF-8字符串的正确方法是什么(这里是sub()的模式参数)?


编辑:要清楚,我有兴趣通过指定其十六进制值在String中输入UTF-8字符。请考虑以下示例(请注意"é"是Unicode U+00E9,可以用UTF-8表示为0xc3a9):

s <- "Cet été."
gsub("té","__",s)
# --> "Cet é__."
# works, but I like to keep my code itself free of UTF-8 literals,
# plus, for the initial question, I really don't want to enter an actual
# UTF-8 "NON BREAKABLE SPACE" in my code as it would be undistinguishable
# from a regular space.

gsub("t\xc3\xa9","__",s)  ## works, but I question how standard and portable
# --> "Cet é__."

gsub("t\\xc3\\xa9","__",s)  ## doesn't work
# --> "Cet été."

gsub("t\x{c3a9}","__",s)  ## would work in Perl, doesn't seem to work in R
# Error: '\x' used without hex digits in character string starting "s\x"

1 个答案:

答案 0 :(得分:2)

(早先的东西已删除。)

EDIT2:

> s <- '\U00A0'
> s
[1] " "
> code <- gsub(s, '__','\xc2\xa0' )
> code
[1] "__"