Question

我正在做一些涉及葡萄牙文的文本挖掘。我的一些自定义文本挖掘功能中还包含其他特殊字符。

我不是这个主题的专家。当我的很多角色开始显示不正确时，我认为我需要更改文件编码。我试过了

ISO-8858-1
ISO-8858-7
UTF-8
WINDOWS-1252

它们都没有改善字符的显示。我是否需要不同的编码，或者我是否会这么错？

例如，当我尝试从GitHub读取这个停用词列表时：

stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt")

他们这样出来：

tail(stop_words, 17)

206    tivÃ©ramos
207         tenha
208      tenhamos
209        tenham
210       tivesse
211   tivÃ©ssemos
212      tivessem
213         tiver
214      tivermos
215       tiverem
216         terei
217         terÃ¡
218       teremos
219        terÃ£o
220         teria
221     terÃamos
222        teriam

我也尝试过stringsAsFactors = F。

我不会说葡萄牙语，但我的直觉告诉我欧元和版权符号不在他们的字母表中。此外，它似乎正在将一些带有重音符号的小写字母改为大写不同的A＆＃39。

如果它有用：

Sys.getlocale()

[1]＆＃34; LC_COLLATE = English_United States.1252; LC_CTYPE = English_United States.1252; LC_MONETARY = English_United States.1252; LC_NUMERIC = C; LC_TIME = English_United States.1252＆＃34;

我还尝试更改区域设置，stri_encode(stop_words$V1, "", "UTF-8")和tail(enc2native(as.vector(stop_words[,1])),17)。

Answer 1

你似乎是对utf-8的双重编码。

以下是utf-8中字符的图表：http://www.i18nqa.com/debug/utf8-debug.html 现在看看＆＃34; Actual＆＃34;列。

如您所见，打印的字符似乎代表实际值而不是编码值。

临时解决方案是解码一层utf-8。

更新

安装R后，我试图重现这个问题这是我的控制台日志，有一个简单的解释：

首先，我复制粘贴你的代码：

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt")
> tail(stop_words, 17)
             V1
206  tivÃ©ramos
207       tenha
208    tenhamos
209      tenham
210     tivesse
211 tivÃ©ssemos
212    tivessem
213       tiver
214    tivermos
215     tiverem
216       terei
217       terÃ¡
218     teremos
219      terÃ£o
220       teria
221   terÃamos
222      teriam

好的，所以它没有按原样运行，所以我在read.table函数的末尾添加了encoding参数。当我尝试使用小写utf-8时，会出现结果：

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt",encoding="utf-8")
> tail(stop_words, 17)
             V1
206  tivÃ©ramos
207       tenha
208    tenhamos
209      tenham
210     tivesse
211 tivÃ©ssemos
212    tivessem
213       tiver
214    tivermos
215     tiverem
216       terei
217       terÃ¡
218     teremos
219      terÃ£o
220       teria
221   terÃamos
222      teriam

最后，我使用 UTF-8和大写字母，现在它正常运行：

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt", encoding = "UTF-8")
> tail(stop_words, 17)
            V1
206  tivéramos
207      tenha
208   tenhamos
209     tenham
210    tivesse
211 tivéssemos
212   tivessem
213      tiver
214   tivermos
215    tiverem
216      terei
217       terá
218    teremos
219      terão
220      teria
221   teríamos
222     teriam

您可能忘记将编码参数放在read.table 的末尾，或者尝试使用小写而不是大写。我从中理解的是，如果您没有指定角色已经编码在其中，R会尝试将角色转换为UTF-8。

Answer 2

我是葡萄牙语，虽然我的编码是

，但我遇到了同样的问题

Sys.getlocale()
[1] "LC_COLLATE=Portuguese_Portugal.1252;LC_CTYPE=Portuguese_Portugal.1252;LC_MONETARY=Portuguese_Portugal.1252;LC_NUMERIC=C;LC_TIME=Portuguese_Portugal.1252"

所以我在网上查了一下，在SO找到了这个提示。

stop_words2 <- sapply(stop_words, as.character)

有效。但我使用read.table(..., stringsAsfactors = FALSE)读取数据。

为什么这些不同的编码不允许我正确显示葡萄牙语？

2 个答案: