Question

我尝试在R（3.4.0）中导入text file，其中实际包含4列，但第4列大部分为空，直到第200行+第2行。我在数据包data.table（ver 1.10.4）

中使用了fread（）

fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)

我收到此错误消息：

Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) : 
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep='  ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.

我检查了文件以及第4栏第258088行中的其他文字（＆＃34; 8-4＆＃34;）。

尽管如此，fill = TRUE并没有像我预期的那样解决这个问题。我认为可能是fread（）不恰当地确定列号，因为附加列在文件中很晚才出现。所以我尝试了这个：

fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)

错误仍然存在。另一方面，

fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)

这没有错误。

我以为我找到了原因，但是当我使用dummy file生成时，发生了奇怪的事情：

write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)

添加＆＃34; 8-4＆＃34;在Excel的第250000行的第4列中。当被fread（）读取时：

fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")

没有错误消息，它工作正常，这应该表明一些后期附加列不一定会触发错误。

我也尝试过更改编码（＆＃34;拉丁语1＆＃34;和＆＃34; UTF-8＆＃34;）或引用，但都没有帮助。

现在我感到无能为力，希望我能用可重复的信息完成我的功课。谢谢你的帮助。

有关其他环境信息，我的sessionInfo（）是：

R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] dplyr_0.5.0            purrr_0.2.2.2          readr_1.1.1            tidyr_0.6.3           
  [5] tibble_1.3.3           ggplot2_2.2.1          tidyverse_1.1.1        stringr_1.2.0         
  [9] microbenchmark_1.4-2.1 data.table_1.10.4     

loaded via a namespace (and not attached):
[1] Rcpp_0.12.11     cellranger_1.1.0 compiler_3.4.0   plyr_1.8.4       forcats_0.2.0   
[6] tools_3.4.0      jsonlite_1.5     lubridate_1.6.0  nlme_3.1-131     gtable_0.2.0    
[11] lattice_0.20-35  rlang_0.1.1      psych_1.7.5      DBI_0.6-1        parallel_3.4.0  
[16] haven_1.0.0      xml2_1.1.1       httr_1.2.1       hms_0.3          grid_3.4.0      
[21] R6_2.2.1         readxl_1.0.0     foreign_0.8-68   reshape2_1.4.2   modelr_0.1.0    
[26] magrittr_1.5     scales_0.4.1     rvest_0.3.2      assertthat_0.2.0 mnormt_1.5-5    
[31] colorspace_1.3-2 stringi_1.1.5    lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2

Answer 1

实际上你提供的两个文件之间存在差异，我认为这是fread不同输出的原因。

第一个文件在第3列之后的行末尾，除了行258088，其中有一个标签，第四列，然后是行的结尾。（您可以使用选项＆＃39;显示所有字符以确认＆＃39;）。

另一方面，第二个文件在所有行中都有一个额外的选项卡，即一个新的空列。所以在第一种情况下，fread需要3列，然后找到第4列。相反，在第二个文件中，fread需要4列。

我用fill=TRUE检查了read.table，它适用于这两个文件。所以我认为使用fread的fill选项可以做一些不同的事情。

我希望自fill=TRUE以来，使用所有行来推断列数（计算时间成本）。

在评论中，您可以使用一些不错的解决方法。

Answer 2

该文件存在问题：如果该表有四列，则在每行的末尾缺少第四列，\t应该存在。

在这种情况下，使用低级方法可能会更好运：逐行读取文件，向没有第四列的每一行添加\t，用{{分隔每一行1}}并在\t中收集所有内容。上述大多数工作都是由data.frame函数完成的。尝试类似：

data.table::tstrsplit

Answer 3

我也在努力解决这个问题。我在这里How can you read a CSV file in R with different number of columns找到了另一个解决方案（对于csv和read.table）。这个答案你可以使用方便的函数count.fields来逐行计算文件的分隔符，然后取最大字段数来将最大的列名数传递给fread。可重复的例子如下。

生成包含不均匀字段数的文本

text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"

写入文件

cat(text, file = "foo")

扫描文件中的分隔符

max.fields<-max(count.fields("foo", sep = ','))

现在使用fread来读取文件，但期望col.names参数中的最大列数

fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))

但是，我将此数据基于?count.fields的示例数据，并发现如果最大字段数位于文件的最后一行，fread仍会失败，并显示以下错误。

fread错误（“foo”，header = FALSE，fill = TRUE，sep =“，”，col.names = paste（“V”，：期望3个cols，但第9行包含处理所有cols后的文本。再次尝试使用fill = TRUE。另一个原因可能是，fread用于区分在不平衡的未转义引号内嵌入了sep ='，'和/或（未转义）'\ n'字符的一个或多个字段的逻辑失败。如果quote =''没有帮助，请提出问题以确定是否可以改进逻辑。

例如

text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))

我会将此问题报告给data.table Github。更新：此处记录的问题https://github.com/Rdatatable/data.table/issues/2691

r - 错误：处理fread中所有cols后的文本（data.table）

3 个答案: