Question

我在阅读包含不常见字符的文件时遇到问题，在本例中为箭头符号 enter image description here 。尝试指定输入文件格式，例如：

> scan('SMKA121212' , what="", sep="\n", blank.lines.skip=T, fileEncoding="UTF-8")
Read 13 items
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  invalid input found on input connection 'SMKA121212'

> scan('SMKA121212', what="", sep="\n", blank.lines.skip=FALSE, encoding="UTF-8")
Read 1724 items

（实际上有超过10k行并且箭头字符上的读取正在破碎）

我对encoding和fileEncoding之间的差异有点不清楚，就R如何回应它不期望的角色而言。澄清可能有用。

感谢任何关于如何强制R读取完整文档的建议，也许只是忽略不符合系统的字符。

Answer 1

我在文本编辑器中看到的是使用“|”作为分隔符而不是“\ n”，并且在第1724行中这个序列：

kibbled [ìGrtzeî or ìgruttenî], pearl...

有两个不同的重音字符似乎包含Grtze和grutten，但您看到的字符未显示。

当我在Mac上阅读它时：

read.table("~/Downloads/lines/1720-1730.txt", sep="|")

有问题的人物如此出现：

[\x93Gr\032tze\x94 or \x93grutten\x94]

所以你看到的'箭头'是\032。我发现很难破译各种“逃逸”R输出的含义。最好看的地方是?Quotes页面，我们知道这是32 octal 或26十进制。您可能希望在输入策略时尝试此操作并查看其进展情况：

x <- read.table("yourpath/filename.txt", sep="|", stringsAsFactors=FALSE, allowEscapes = TRUE)

如果这还不够，那么尝试添加一个编码选项“latin1”，“UTF-8”，“UTF-16”，如果不成功，还有其他Windows编码尚未尝试。

当您收到有关较少数量元素的消息时，通常意味着存在不匹配的引用或嵌入式哈希（“＃”）。您可以添加以下参数：quote="", comment.char=""。如果您想查看这些附加注释的效果，可以使用：

 table(count.fields("yourpath/filename.txt", sep="|", stringsAsFactors=FALSE, 
         allowEscapes = TRUE, quote="", comment.char=""))

还有进一步的检查操作可以让你看到哪些线路存在问题：

 which(count.fields("yourpath/filename.txt", sep="|", stringsAsFactors=FALSE, 
         allowEscapes = TRUE, quote="", comment.char="") == 28)

您的语言环境与默认编码之间可能存在不匹配。您应该报告sessionInfo()

的结果

编码我看到提到解决奇怪的问题包括“CP1252”，“Latin2”（这是ISO-8859-2），但我发现编码列表比我预期的要大：

 iconvlist()  # 419 encodings

如果您知道创建该文件的组织，那么为什么不问他们呢？

从“master”zip文件中包含的多个zip文件中的第一个文件开始，我们看到此建议使用count.fields：

table( count.fields("~/Downloads/SMKA12_2012archive/SMKA121212", quote="", 
      sep="|",comment.char="") )
#------------
   15    27    28 
    1 10228     1 
which( count.fields("~/Downloads/SMKA12_2012archive/SMKA121212", quote="", sep="|",comment.char="") ==15)
#[1] 1
which( count.fields("~/Downloads/SMKA12_2012archive/SMKA121212", quote="", sep="|",comment.char="") ==28)
#[1] 10230

使用R 3.0.1和TextEdit.app在Mac上阅读这些文件。第一条记录似乎不是标题，而是一种表示法，可能表示数据记录月份：

<00> 000000000 ||||||||||||||||||||||||||| HMCUSTOMS CONTROL DATA | 2012 | 12

最后一条记录有一条非数据尾随记录，其中包含附加到它的最终记录计数。 999999999 | | | | | | | | | | | | | | | | | | | | | | | | | | | 0010228

因此，使用skip = 1和fill = TRUE应该允许无错输入。

dat <- read.table("~/Downloads/SMKA12_2012archive/SMKA121212", quote="", sep="|",comment.char="", fill=TRUE, skip=1 , colClasses=c( rep("integer", 2), rep("character", 4), rep("integer", 24-7+1), rep("character", 3)))
> str(dat)
'data.frame':   10230 obs. of  27 variables:
 $ V1 : int  10110100 10110900 10121000 10129100 10129900 10130000 10190000 10190110 10190190 10190300 ...
 $ V2 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ V3 : chr  "00/00" "00/00" "01/12" "01/12" ...
 $ V4 : chr  "12/11" "12/11" "00/00" "00/00" ...
 $ V5 : chr  "00/00" "00/00" "01/12" "01/12" ...
 $ V6 : chr  "12/11" "12/11" "00/00" "00/00" ...
 $ V7 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ V8 : int  150 150 150 150 150 150 150 150 150 150 ...
 $ V9 : int  2 2 2 2 2 2 2 2 2 2 ...
 $ V10: int  13 13 13 13 13 13 13 13 13 13 ...
 $ V11: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V12: int  200 200 200 200 200 200 200 200 200 200 ...
 $ V13: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V14: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V15: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V16: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V17: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V18: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V19: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V20: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V21: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V22: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V23: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V24: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V25: chr  "KG " "KG " "KG " "KG " ...
 $ V26: chr  "NO " "NO " "NO " "NO " ...
 $ V27: chr  "Pure-bred breeding horses                                                                                                      "| __truncated__ "Pure-bred breeding asses                                                                                                       "| __truncated__ "Pure-bred breeding horses                                                                                                      "| __truncated__ "Horses for slaughter                                                                                                           "| __truncated__ ...

就编码问题而言，我无法提供进一步的见解：

Encoding (readLines("~/Downloads/SMKA12_2012archive/SMKA121212", n=1))
#[1] "unknown"

读取具有不寻常字符的数据

1 个答案: