我正在导入一个逗号分隔到MATLAB中的CSV文件。每列都有引号,我想要考虑的是文本,然后是逗号。
我正在使用此问题答案中的read_mixed_csv函数将数据作为单元格读取:Import CSV file with mixed data types
thisdata = read_mixed_csv(fname, ','); % Reads in the CSV file
thisdata = regexprep(thisdata, '^"|"$','');
但是,由于我的一些列看起来像这样:
"FAIRHOPE, Alabama"
"FAIRHOPE HIGH SCHOOL, FAIRHOPE, ALABAMA"
"Daphne-Fairhope-Foley, AL"
MATLAB将逗号后的所有内容放入新列。所以
"Daphne-Fairhope-Foley, AL"
成为两列
"Daphne-Fairhope-Foley
AL"
如何让MATLAB读取混合csv文件,不仅将逗号视为分隔符,还要考虑引号?是否有更自动化的方式来执行此操作?{ {1}}?如果textscan
是一个选项,那会是什么样的?
以下是我尝试阅读标题的数据示例:
textscan
*注意:将CSV文件转换为制表符分隔文件使MATLAB更容易处理并解决此问题。
答案 0 :(得分:1)
使用文本限定符(如"
)有点棘手,但如果确保表的每一行具有相同的列数(并且可能没有空列),则可能会有以下情况。< / p>
不在文字限定符范围内的任何内容都必须可转换为数字。
function C = csvmixed(eachLine,delim,textQualifier)
% Outputs cell containing mixed string and numeric data given a delimiter (',')
% and a text qualifier ('"'). Each line of the delimited file must be loaded into
% the cell array eachLine, and each line must have the same number of columns.
%
% Example:
% fid = fopen('testcsv.txt','r');
% eachLine = textscan(fid,'%s','Delimiter','\n'); fclose(fid);
% C = csvmixed(eachLine{1},',','"')
assert(ischar(delim) && numel(delim)==1);
assert(ischar(textQualifier) && numel(textQualifier)==1);
% find strings, as specified by the input qualifier
patternStr = sprintf('"([^"]*)"%c?',delim);
patternStr = strrep(patternStr,'"',textQualifier);
Cstr = regexp(eachLine,patternStr,'tokens');
% find numeric data
patternNum = sprintf('(?<=(,|^))[^%c,a-zA-Z]*(?=(,|$))',textQualifier);
patternNum = strrep(patternNum,',',delim);
Cnum = regexp(eachLine,patternNum,'match','emptymatch');
numCols = cellfun(@numel,Cstr) + cellfun(@numel,Cnum);
assert(nnz(diff(numCols))==0,'Number of columns not consistent.')
% get string extents (begin, start) indexes for each string
strExtents = regexp(eachLine,patternStr,'tokenExtents');
% deal out parsed data for each line
C = cell(numel(eachLine),numCols(1));
for ii = 1:numel(eachLine),
strBounds = vertcat(strExtents{ii}{:});
delimLocs = getDelimLocs(eachLine{ii},strBounds,delim);
strCellMap = getCellMap(strBounds,delimLocs);
C(ii,strCellMap) = [Cstr{ii}{:}]; % TODO: preallocate
C(ii,~strCellMap) = num2cell(str2double(Cnum{ii})); % all else must be numeric
end
end
function delimLocs = getDelimLocs(lineText,solidBounds,delim)
delimCharLocs = strfind(lineText,delim);
delimLocs = delimCharLocs(~any(bsxfun(@ge,delimCharLocs,solidBounds(:,1)) & ...
bsxfun(@le,delimCharLocs,solidBounds(:,2)),1));
end
function cellMap = getCellMap(typeBounds,delimLocs)
cellMap = any(bsxfun(@gt,typeBounds(:,1),[0 delimLocs]) & ...
bsxfun(@lt,typeBounds(:,1),[delimLocs Inf]), 1);
end
更新:修复getDelimLocs
中的小拼写错误。添加单元格阵列的预分配。
答案 1 :(得分:0)
使用文件交换代码replaceinfile
将带有逗号的字符串替换为句点。
使用Import CSV file with mixed data types中的read_mixed_csv
来读取文件。
从剩下的字符串中删除多余的引号。
replaceinfile(', ', '. ', fname); % Replace commas that was inside quotes and not meant to be separated as periods so they don't show up as a new column
thisdata = read_mixed_csv(fname, ','); % Reads in the CSV file (\t for tab)
thisdata = regexprep(thisdata, '^"|"$',''); % Remove quotes from file and only keep the first 28 columns (last two columns are empty)
对于replaceinfile.m
函数:
要在Linux上运行代码,请将Perl部分的第一行更改为
perlCmd = sprintf('"%s"', '/usr/bin/perl');