我有一个大的(8GB +)csv文件(以逗号分隔),我想读入R.该文件包含三列
date
#in 2017-12-27格式text
#a string type
每个字符串#a标签(NA
,typeA
或typeB
)我遇到的问题是text
列包含各种字符串指示符:'
(单引号),"
(双引号),无引号。标记,以及多个分隔的字符串。
E.g。
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
为了阅读这些大数据,我试过了:
read.csv
和readr
' read_csv
功能:适用于部分工作但失败(可能是由于记忆)或需要很长时间才能阅读fread
(我希望这可以解决其他两个问题):Error: Expecting 3 cols, but line 1103 contains text after processing all cols.
我的想法是通过使用我所知道的数据的细节解决这些问题,即每行以日期开头并以NA
,typeA
或{{1}结尾}。
我如何实现这一点(使用纯typeB
或readLines
)?
编辑: 使用Mac TextWrangler打开的示例数据(匿名):
fread
样本数据2:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid",typeA
可重现"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","@myid. I'll change this.",NA
错误的示例数据 fread
:
"Expecting 3 cols, but line 3 contains text after processing all cols."
SessionInfo:
"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file \""Macintosh HD:abc:def:filename\"", \""/abc.txt\""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA
答案 0 :(得分:1)
readLines
方法可能
infile <- file("test.txt", "r")
txt <- readLines(infile, n = 1)
df <- NULL
#change this value as per your requirement
chunksize <- 1
while(length(txt)){
txt <- readLines(infile, warn=F, n = chunksize)
df <- rbind(df, data.frame(date = gsub("\\s.*", "", txt),
text = trimws(gsub("\\S+(.*)\\s+\\S+$", "\\1", txt)),
type = gsub(".*\\s", "", txt),
stringsAsFactors = F))
}
给出了
> df
date text type
1 2016-01-01 great job! NA
2 2016-01-02 please, type "submit" typeA
3 2016-01-02 "can't see the "error" now" typeA
4 2016-01-03 "add \\\\"/filename.txt\\\\"" NA
示例数据: test.txt
包含
date text type
2016-01-01 great job! NA
2016-01-02 please, type "submit" typeA
2016-01-02 "can't see the "error" now" typeA
2016-01-03 "add \\"/filename.txt\\"" NA
的更新强>
您可以使用以下正则表达式解析器修改上面的代码,以解析另一组样本数据
df <- rbind(df, data.frame(date = gsub("\"(\\S{10}).*", "\\1", txt),
text = gsub(".*\"\\,\"(.*)\"\\,(\"|NA).*", "\\1", txt),
type = gsub(".*\\,|\"", "", txt),
stringsAsFactors = F))
另一组样本数据:
"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid","typeA"