我尝试使用readLines(..., encoding='UTF-8')
清理一些加载到内存中的文本。
如果我没有指定编码,我会看到各种奇怪的字符,如:
> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> 😜ðŸ˜â˜º'"
这是readLines(...,encoding =' UTF-8')之后的样子:
> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"
你可以在最后看到unicode文字:\ u009f,\ u0098等。
我无法找到正确的命令和正则表达式来摆脱这些。我试过了:
gsub('[^[:punct:][:alnum:][\\s]]', '', text)
我也试过指定unicode字符,但我相信它们会被解释为文本:
gsub('\u009', '', text) # Unchanged
答案 0 :(得分:5)
摆脱这些字符的最简单方法是将utf-8转换为ascii:
combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')
答案 1 :(得分:2)
如果要使用正则表达式,只能使用一系列ASCII代码保留所需的字符:
text = "The way I talk to my family......i would get my ass beat to
DEATH....but they kno I cray cray & just leave it at that 😜ðŸ˜â˜º'"
gsub('[^\x20-\x7E]', '', text)
# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"
以下是从asciitable.com获取的ASCII代码表:
你可以看到我删除了不在x20(SPACE)和x7E(〜)范围内的任何字符。