我有一个R“插件”,它从stdin读取一堆行,解析它并对其进行评估。
...
code <- readLines(f, warn=F) ## that's where the lines come from...
result <- eval(parse(text=code))
...
现在,有时提供代码行的系统会在代码中插入UTF-8非中断空格(U+00A0
= \xc2\xa0
)。 parse()
扼杀了这些角色。例如:
s <- "1 +\xc2\xa03"
s
[1] "1 + 3" ## looks fine doesn't it? In fact, the Unicode "NON-BREAK SPACE" is there
eval(parse(text=s))
Error in parse(text = s) : <text>:1:4: unexpected input
1: 1 +?
^
eval(parse(text=gsub("\xc2\xa0"," ",s)))
[1] 4
我想用常规空格替换那个角色,并且可以这样做(但我自己也有危险),如上所述:
code <- gsub('\xc2\xa0',' ',code)
然而,这并不干净,因为字节序列'\xc2\a0'
可以想象在另一个2字节字符的中间开始匹配,其第二个字节是0xc2
。
或许好一点,我们可以说:
code <- gsub(intToUtf8(0x00a0L),' ',code)
但这不会推广到UTF-8字符串。
当然有一种更好,更有表现力的方式来输入包含一些UTF-8字符的字符串?一般来说,表达UTF-8字符串的正确方法是什么(这里是sub()
的模式参数)?
编辑:要清楚,我有兴趣通过指定其十六进制值在String中输入UTF-8字符。请考虑以下示例(请注意"é"
是Unicode U+00E9
,可以用UTF-8表示为0xc3a9
):
s <- "Cet été."
gsub("té","__",s)
# --> "Cet é__."
# works, but I like to keep my code itself free of UTF-8 literals,
# plus, for the initial question, I really don't want to enter an actual
# UTF-8 "NON BREAKABLE SPACE" in my code as it would be undistinguishable
# from a regular space.
gsub("t\xc3\xa9","__",s) ## works, but I question how standard and portable
# --> "Cet é__."
gsub("t\\xc3\\xa9","__",s) ## doesn't work
# --> "Cet été."
gsub("t\x{c3a9}","__",s) ## would work in Perl, doesn't seem to work in R
# Error: '\x' used without hex digits in character string starting "s\x"
答案 0 :(得分:2)
(早先的东西已删除。)
EDIT2:
> s <- '\U00A0'
> s
[1] " "
> code <- gsub(s, '__','\xc2\xa0' )
> code
[1] "__"