Question

我尝试使用readLines(..., encoding='UTF-8')清理一些加载到内存中的文本。

如果我没有指定编码，我会看到各种奇怪的字符，如：

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> ðŸ˜œðŸ˜â˜º'"

这是readLines（...，encoding =＆＃39; UTF-8＆＃39;）之后的样子：

> "The way I talk to my family......i would get my ass beat to
> DEATH....but they  kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"

你可以在最后看到unicode文字：\ u009f，\ u0098等。

我无法找到正确的命令和正则表达式来摆脱这些。我试过了：

gsub('[^[:punct:][:alnum:][\\s]]', '', text)

我也试过指定unicode字符，但我相信它们会被解释为文本：

gsub('\u009', '', text) # Unchanged

Answer 1

摆脱这些字符的最简单方法是将utf-8转换为ascii：

combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')

Answer 2

如果要使用正则表达式，只能使用一系列ASCII代码保留所需的字符：

text = "The way I talk to my family......i would get my ass beat to 
DEATH....but they kno I cray cray & just leave it at that ðŸ˜œðŸ˜â˜º'"

gsub('[^\x20-\x7E]', '', text)

# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"

以下是从asciitable.com获取的ASCII代码表：

你可以看到我删除了不在x20（SPACE）和x7E（〜）范围内的任何字符。

如何在R中使用gsub删除奇怪的字符？

2 个答案: