Question

我正在尝试使用fread读取表格。 txt文件的文本如下：

"No","Comment","Type"
"0","he said:"wonderful|"","A"
"1","Pr/ "d/s". "a", n) ","B"

我使用的R代码是：dataset0 <- fread("data/test.txt", stringsAsFactors = F)，其中包含development version data.table R包。

期望看到包含三列的数据集;但是：

Error in fread(input = "data/stackoverflow.txt", stringsAsFactors = FALSE) : 
Line 3 starting <<"1","Pr/ ">> has more than the expected 3 fields.
Separator 3 occurs at position 26 which is character 6 of the last field: << n) ","B">>. 
Consider setting 'comment.char=' if there is a trailing comment to be ignored.

如何解决？

Answer 1

data.table的development version处理这样的文件，其中嵌入的引号尚未被转义。请参阅point 10 on the wiki page。

我刚刚对你的输入进行了测试，但它确实有效。

$ more unescaped.txt
"No","Comment","Type"
"0","he said:"wonderful."","A"
"1","The problem is: reading table, and also "a problem, yes." keep going on.","A"

> DT = fread("unescaped.txt")
> DT
   No                                                                  Comment Type
1:  0                                                     he said:"wonderful."    A
2:  1 The problem is: reading table, and also "a problem, yes." keep going on.    A
> ncol(DT)
[1] 3

Answer 2

使用readLines逐行阅读，然后替换分隔符和read.table：

# read with no sep
x <- readLines("test.txt")

# introduce new sep - "|"
x <- gsub("\",\"", "\"|\"", x)

# read with new sep
read.table(text = x, sep = "|", header = TRUE)

#   No                                                                  Comment Type
# 1  0                                                     he said:"wonderful."    A
# 2  1 The problem is: reading table, and also "a problem, yes." keep going on.    A

fread - 字符串中的多个分隔符

2 个答案: