我想从一个大约有800k行的文本文件中将数据导入Matlab,看起来像这样:
"209","1000",".10500","N/A","36","116","2006-03-16 00:00:00","2519","431.400000","-6.760000","568.600000","142.620000",".000000",".000000",".000000",".000000","2","CHARGEOFF","","","2008-02-16 00:00:00","33.100000"
"190","1000",".18750","N/A","36","116","2006-03-14 00:00:00","0",".000000","-5.230000","1000.000000","269.370000","20.000000","60.000000","4.910000",".000000","4","COMPLETED","","","2009-03-14 00:00:00",".000000"
但是,对于某些条目(上面未显示),逗号是引号内部字符串的一部分。例如,“N,A”。
为简化起见,我把所有文件都删除了,然后我发现某些行的逗号数量不均匀,将数据导入Matlab变得更加困难。
readtable可以导入它,但是它需要太长时间,然后将值存储为字符,例如,不是将209存储为数字,而是将其作为包含内容'209'的字符串导入
谢谢!
答案 0 :(得分:0)
首先,我将以下字符串保存在文件yourFile.txt
中。请注意,
和N
A
"209","1000",".10500","N,A","36","116","2006-03-16 00:00:00","2519","431.400000","-6.760000","568.600000","142.620000",".000000",".000000",".000000",".000000","2","CHARGEOFF","","","2008-02-16 00:00:00","33.100000"
我首先使用readtext
来读取文本文件,如下所示:
fileContents=readtext('yourFile.txt',',"'); % ," is the delimiter.
% If you want to keep the entries between the quotes as characters.
processedContentChar=cellfun(@(x) regexprep(x,'"',''),fileContents,'uni',0);
% If you want numeric entries, however 'N,A' will be converted to NaN.
processedContentNum=cellfun(@(x) str2double(regexprep(x,'"','')),fileContents,'uni',0);
答案 1 :(得分:0)
我所做的是使用sed准备数据来查找|,删除它们,替换","与|然后找到"并删除它们。基本上使用Parag上面的答案的想法。
time sed 's/|//g' file.csv | sed 's/","/|/g' | sed 's/"//g' > file_bar.csv
这需要3.5分钟的~800k行文件,540列
然后在Matlab中我使用readtable指定分隔符作为|那需要10分钟。