使用fread以双引号和不正确的转义字符读取数据

时间:2016-02-25 12:05:15

标签: r data.table

我尝试使用 data.table 包中的fread()加载大型数据文件(大约2000万行)。但是,有些行会造成很大麻烦。

最小例子:

text.csv contains:

id, text
1,"""Oops"",\""The"",""Georgia"""        

fread("text.csv", sep=",")

Error in fread("text.csv", sep = ",") : 
  Not positioned correctly after testing format of header row. ch=','
In addition: Warning message:
In fread("text.csv", sep = ",") :
  Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text

read.table()效果稍好,但速度太慢,内存效率太低。

> read.table("text.csv", header = TRUE, sep=",")
  id                     text
1  1 "Oops",\\"The","Georgia"

我意识到我的文本文件格式不正确,但实在太大而无法编辑。

任何帮助都非常感激。

修改

一小部分实际数据记录:

sample1.txt, a good record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music

> fread("sample1.txt", sep=",")
               materiale_id      dk5                      description         creator subject-phrase
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA
                                           title  type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music


sample2.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article

> fread("sample2.txt", sep=",")
Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"...

编辑2:

更新到R版本3.2.3和data.table 1.9.6。帮助解决上述问题,但会产生其他记录的问题:

sample3.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music
125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") : 
  Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

编辑3:

更新到数据表的开发版本1.9.7会完全打破fread()

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") : 
  showProgress is not type integer but type 'logical'. Please report.

编辑4:

当记录包含字符串\\"(litteraly,而不是正则表达式)时,我的文件中出现了问题。显然,有一个反斜杠太多,导致fread()误解双引号作为字符串的结尾,它本应该被采用。

到目前为止,我最好的解决方案是:

m1 <- readLines("data.csv", encoding="UTF-8")
m2 <- gsub("\\\\\"", "\\\"", m1)    
writeLines(m2, "data_new.csv", useBytes = TRUE)
m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")

这似乎有效。

我不明白这100%,所以任何澄清都非常受欢迎。

1 个答案:

答案 0 :(得分:2)

不是 data.table 解决方案,但您可以尝试:

# read the file with 'readLines'
tmp <- readLines("trl.txt")

# create a column name vector of the first line
nms <- trimws(strsplit(tmp[1],',')[[1]])

# convert 'tmp' to a dataframe except the first line
tmp <- as.data.frame(tmp[-1])

# use 'separate' from 'tidyr' to split into two columns
library(tidyr)
df1 <- separate(tmp, "tmp[-1]", nms, sep=",", extra = "merge")

给出:

> df1
  id                             text
1  1 """Oops"",\\""The"",""Georgia"""

针对编辑1的更新:使用新示例数据fread似乎正常读取数据:

> s1 <- fread("sample1.txt", sep=",")
> s1
               materiale_id      dk5                      description         creator subject-phrase                                         title  type
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA Koncert i Copenhagen Jazz House den 26.1.1995 music


> s2 <- fread("sample2.txt", sep=",")
> s2
               materiale_id      dk5
1: 125030-katalog:000000003 [78.793]
2:   150012-leksikon:100019         
                                                                                                                                                                                                           description
1:                                                                                                                                                                                    Privatoptagelse. - Liveoptagelse
2: Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...
           creator                                                                                                                         subject-phrase
1: Frederik Lundin                                                                                                                                       
2:                 [""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]
                                           title    type
1: Koncert i Copenhagen Jazz House den 26.1.1995   music
2:               it - elementer i databehandling article

更新编辑2&amp; 3:

当您查看错误消息时:

  

fread("sample3.txt", sep = ",")中的错误:期待7个col,但是   第3行包含处理所有cols后的文本。这很有可能   这是由于一个或多个字段嵌入了sep=','和/或   (未转义)&#39; \ n&#39;不平衡的未转义报价中的字符。 fread   不能处理这种模棱两可的案件,这些线路可能没有   按预期读入。请阅读?fread中有关引号的部分。

然后当您查看sample3.txt的第二行时,您会看到第四列也包含逗号。您可以通过三个步骤解决此问题:

1:使用readLines读取文件,并用另一个引号字符替换第四列的开始和结束字符:

r3 <- readLines("sample3.txt")
r3 <- gsub('\"[',"'",r3,fixed=TRUE)
r3 <- gsub(']\"',"'",r3,fixed=TRUE)

2:将其写回文本文件:

 writeLines(r3, "sample3-1.txt")

3:现在,您可以使用fread(或read.table / read.csv)阅读。由于列标题的数量与列数不同,因此您必须使用header = FALSE。还明确地将quote-character设置为在步骤2中插入的新引号字符:

s3 <- fread("sample3-1.txt", quote = "\'", header = FALSE, skip = 1)

给出:

> s3
                         V1 V2 V3                   V4                                                        V5           V6                               V7    V8
1: 125030-katalog:000236595 NA NA Red Tampa Solist prf                         ""Tom"",""Georgia"",""1929-1930"" "Georgia Tom                       1929-1930" music
2: 125030-katalog:000236596 NA NA  Jane Lucas (Solist) ""1928-1931"",""Tom,\\""The"",""Georgia"",""Accompanist"" "Georgia Tom ""The Accompanist"" (1928-1931)" music

之后,您可以按如下方式指定列名称:

names(s3) <- c("character","vector","with","eight","column","names")

注意:我使用了v1.9.7的最新版本(两周前)