Question

我正在努力解决一些编码问题。我有许多文本文件包含以下格式的行：

https://dl.dropboxusercontent.com/u/94114397/example.txt

根据Notepad ++，这些都是以UTF-8编码的，并且大多数非ASCII字符都正确显示，如第1行和第2行所示。但是，我遇到一些似乎被错误解释的字符问题（？）。在我的示例文件中，这是“Lakuic”一词中第3行的情况，其中“u”和“i”之间应该有“š”。这两个字母之间实际上有一个字符，可以通过将该字词复制粘贴到Google Chrome地址栏中来查看。

现在，当我读取R中的数据时，它显示“Laku＆lt; U + 009A＆gt; ic”。我该如何解决这个问题？

Answer 1

尝试从UTF-8转换为latin1：

    df <- read.table("http://dl.dropboxusercontent.com/u/94114397/example.txt", sep = "\t", row.names = 1, stringsAsFactors = FALSE, encoding="UTF-8")
    iconv(df[, 1], from = "UTF-8", to = "latin1")
    # [1] "Trichocentrum<->longifolium<-><->(Lindl.) R.Jiménez, Acta Bot. Mex. 97: 54 (2011)." 
    # [2] "Salvia<->× hegelmaieri<->nothosubsp. accidentalis<->(Sánchez-Gómez & R.Morales)."   
    # [3] "Edraianthus<->tarae<-><->Lakušic, Bilten Drustva Ekologa BiH, Ser. A 4: 108 (1987)."

我的sessioInfo()：

# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 7 x64 (build 7601) Service Pack 1
# 
# locale:
#   [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                    LC_TIME=German_Germany.1252

Answer 2

这对我有用：

FB.ui({
  method: 'feed',
  link: 'https://developers.facebook.com/docs/',
  caption: 'An example caption',
}, function(response){});

R编码UTF-8：U + 0080-U + 009F

2 个答案: