Question

我有一个大的（8GB +）csv文件（以逗号分隔），我想读入R.该文件包含三列

date #in 2017-12-27格式
text #a string
type每个字符串#a标签（NA，typeA或typeB）

我遇到的问题是text列包含各种字符串指示符：'（单引号），"（双引号），无引号。标记，以及多个分隔的字符串。

E.g。

date        text                        type
2016-01-01  great job!                  NA
2016-01-02  please, type "submit"       typeA
2016-01-02  "can't see the "error" now" typeA
2016-01-03  "add \\"/filename.txt\\""   NA

为了阅读这些大数据，我试过了：

基础read.csv和readr＆＃39; read_csv功能：适用于部分工作但失败（可能是由于记忆）或需要很长时间才能阅读
通过Mac终端将数据分块成1m行的批次：因为线条似乎随意中断而失败
使用fread（我希望这可以解决其他两个问题）：Error: Expecting 3 cols, but line 1103 contains text after processing all cols.

我的想法是通过使用我所知道的数据的细节解决这些问题，即每行以日期开头并以NA，typeA或{{1}结尾}。

我如何实现这一点（使用纯typeB或readLines）？

编辑：使用Mac TextWrangler打开的示例数据（匿名）：

fread

样本数据2：

"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid",typeA

可重现"date","text","type" "2018-05-02","i try this, but it doesnt work",NA "2018-05-02","Thank you very much. Cheers !!",NA "2018-05-02","@myid. I'll change this.",NA错误的示例数据 fread：

"Expecting 3 cols, but line 3 contains text after processing all cols."

SessionInfo：

"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA

Answer 1

readLines方法可能

infile <- file("test.txt", "r")
txt <- readLines(infile, n = 1)
df <- NULL

#change this value as per your requirement
chunksize <- 1

while(length(txt)){
  txt <- readLines(infile, warn=F, n = chunksize)
  df  <- rbind(df, data.frame(date = gsub("\\s.*", "", txt),
                              text = trimws(gsub("\\S+(.*)\\s+\\S+$", "\\1", txt)),
                              type = gsub(".*\\s", "", txt),
                              stringsAsFactors = F))
  }

给出了

> df
        date                          text  type
1 2016-01-01                    great job!    NA
2 2016-01-02         please, type "submit" typeA
3 2016-01-02   "can't see the "error" now" typeA
4 2016-01-03 "add \\\\"/filename.txt\\\\""    NA

示例数据： test.txt包含

date        text                        type
2016-01-01  great job!                  NA
2016-01-02  please, type "submit"       typeA
2016-01-02  "can't see the "error" now" typeA
2016-01-03  "add \\"/filename.txt\\""   NA

的更新您可以使用以下正则表达式解析器修改上面的代码，以解析另一组样本数据

df <- rbind(df, data.frame(date = gsub("\"(\\S{10}).*", "\\1", txt), text = gsub(".*\"\\,\"(.*)\"\\,(\"|NA).*", "\\1", txt), type = gsub(".*\\,|\"", "", txt), stringsAsFactors = F))

另一组样本数据：

"date","text","type" "2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA "2016-04-01","Fiex this now. Please check.","typeA" "2016-04-01","Does it work? Maybe try the other approach.","typeB" "2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA "2014-05-02","Tried to remove ""../"" but no success @myid","typeA"

使用杂乱的字符串和多个字符串指示符读取大数据R

1 个答案: