Question

我正在使用哈利波特的七本书开展文本挖掘项目。有一个包含书籍文本的R包。在这个包中，每本书都是一个向量，每一章都是一个包含在向量中的字符串。

在为我的分析准备字符串时，我一直在遇到某些我无法识别的空白，并且无法弄清楚如何删除。这由以下代码说明：

require(devtools)
devtools::install_github("bradleyboehmke/harrypotter")
require(harrypotter)

temp <- substr(philosophers_stone[1], 0, 31)
temp

temp <- gsub(" ", "", temp)
temp

temp <- gsub("[\t\n\r\v\f]", "", temp)
temp

代码的输出如下：

temp <- substr(philosophers_stone[1], 0, 31)
temp
# [1] "THE BOY WHO LIVED　　Mr. and Mrs."
temp <- gsub(" ", "", temp)
temp
# [1] "THEBOYWHOLIVED　　Mr.andMrs."
temp <- gsub("[\t\n\r\v\f]", "", temp)
temp
# [1] "THEBOYWHOLIVED　　Mr.andMrs."
temp <- gsub("&nbsp;", "", temp)
temp
# [1] "THEBOYWHOLIVED　　Mr.andMrs."

任何人都可以帮我弄清楚这是什么东西，以及我如何摆脱它？

Answer 1

使用charToRaw：

charToRaw（温度）

#  [1] 54 48 45 20 42 4f 59 20 57 48 4f 20 4c 49 56 45 44 e3 80 80 e3 80 80 4d 72 2e 20 61 6e 64 20 4d
# [33] 72 73 2e

这里的每个元素对应一个字符（基本上）。我们可以推断出麻烦的空白是e3 80 80（重复两次）。根据与here对应的"ideographic space"，宽度均匀的空间（通常用于固定宽度的脚本，例如中文或日文）。

无论如何，现在我们可以将rawToChar和gsub转换回去了：

gsub(rawToChar(as.raw(c('0xe3', '0x80', '0x80'))), '', temp)
# [1] "THE BOY WHO LIVEDMr. and Mrs."

（fixed = TRUE可以添加速度，但这并不相关，因为你还要削减所有其他空格

仅使用\s的FWIW也适用于我（同样适用于Richard Scriven的其他建议，[[:space:]]）：

gsub('\\s', '', temp)
# [1] "THEBOYWHOLIVEDMr.andMrs."

由于区域设置或平台问题，我猜\s不适合您;来自?regex：

[:space:]空格字符：制表符，换行符，垂直制表符，换页符，回车符，空格和可能还有其他与语言环境相关的字符。 [强调我的]

Answer 2

奇怪 - 不确定如何定义空格。但是，您可以尝试将奇怪的空白（您提供的示例中的字符18和19）存储为变量，然后在文本中将其替换为：

require(devtools)
devtools::install_github("bradleyboehmke/harrypotter")
require(harrypotter)

temp <- substr(philosophers_stone[1], 0, 31)
x <- substr(temp, 18, 19)
temp <- gsub(x, "", temp)
temp <- gsub(" ", "", temp)

身份不明的空白

2 个答案: