Question

我正在导入一个逗号分隔到MATLAB中的CSV文件。每列都有引号，我想要考虑的是文本，然后是逗号。

我正在使用此问题答案中的read_mixed_csv函数将数据作为单元格读取：Import CSV file with mixed data types

thisdata = read_mixed_csv(fname, ','); % Reads in the CSV file 
thisdata = regexprep(thisdata, '^"|"$','');

但是，由于我的一些列看起来像这样：

"FAIRHOPE, Alabama"
"FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA"
"Daphne-Fairhope-Foley, AL"

MATLAB将逗号后的所有内容放入新列。所以

"Daphne-Fairhope-Foley, AL"

成为两列

"Daphne-Fairhope-Foley
AL"

如何让MATLAB读取混合csv文件，不仅将逗号视为分隔符，还要考虑引号？是否有更自动化的方式来执行此操作？{ {1}}？如果textscan是一个选项，那会是什么样的？

以下是我尝试阅读标题的数据示例：

textscan

*注意：将CSV文件转换为制表符分隔文件使MATLAB更容易处理并解决此问题。

Answer 1

使用文本限定符（如"）有点棘手，但如果确保表的每一行具有相同的列数（并且可能没有空列），则可能会有以下情况。< / p>

不在文字限定符范围内的任何内容都必须可转换为数字。

function C = csvmixed(eachLine,delim,textQualifier)
% Outputs cell containing mixed string and numeric data given a delimiter (',') 
% and a text qualifier ('"').  Each line of the delimited file must be loaded into 
% the cell array eachLine, and each line must have the same number of columns.
% 
% Example:
%   fid = fopen('testcsv.txt','r');
%   eachLine = textscan(fid,'%s','Delimiter','\n'); fclose(fid);
%   C = csvmixed(eachLine{1},',','"')

assert(ischar(delim) && numel(delim)==1);
assert(ischar(textQualifier) && numel(textQualifier)==1);

% find strings, as specified by the input qualifier
patternStr = sprintf('"([^"]*)"%c?',delim);
patternStr = strrep(patternStr,'"',textQualifier);
Cstr = regexp(eachLine,patternStr,'tokens');

% find numeric data
patternNum = sprintf('(?<=(,|^))[^%c,a-zA-Z]*(?=(,|$))',textQualifier);
patternNum = strrep(patternNum,',',delim);
Cnum = regexp(eachLine,patternNum,'match','emptymatch');

numCols = cellfun(@numel,Cstr) + cellfun(@numel,Cnum);
assert(nnz(diff(numCols))==0,'Number of columns not consistent.')

% get string extents (begin, start) indexes for each string
strExtents = regexp(eachLine,patternStr,'tokenExtents');

% deal out parsed data for each line
C = cell(numel(eachLine),numCols(1));
for ii = 1:numel(eachLine),
    strBounds = vertcat(strExtents{ii}{:});
    delimLocs = getDelimLocs(eachLine{ii},strBounds,delim);
    strCellMap = getCellMap(strBounds,delimLocs);

    C(ii,strCellMap) = [Cstr{ii}{:}]; % TODO: preallocate
    C(ii,~strCellMap) = num2cell(str2double(Cnum{ii})); % all else must be numeric
end

end

function delimLocs = getDelimLocs(lineText,solidBounds,delim)
    delimCharLocs = strfind(lineText,delim);
    delimLocs = delimCharLocs(~any(bsxfun(@ge,delimCharLocs,solidBounds(:,1)) & ...
        bsxfun(@le,delimCharLocs,solidBounds(:,2)),1));
end

function cellMap = getCellMap(typeBounds,delimLocs)
    cellMap = any(bsxfun(@gt,typeBounds(:,1),[0 delimLocs]) & ...
        bsxfun(@lt,typeBounds(:,1),[delimLocs Inf]), 1);
end

更新：修复getDelimLocs中的小拼写错误。添加单元格阵列的预分配。

Answer 2

使用文件交换代码replaceinfile将带有逗号的字符串替换为句点。使用Import CSV file with mixed data types中的read_mixed_csv来读取文件。从剩下的字符串中删除多余的引号。

replaceinfile(', ', '. ', fname); % Replace commas that was inside quotes and not meant to be separated as periods so they don't show up as a new column
thisdata = read_mixed_csv(fname, ','); % Reads in the CSV file (\t for tab)
thisdata = regexprep(thisdata, '^"|"$',''); % Remove quotes from file and only keep the first 28 columns (last two columns are empty)

对于replaceinfile.m函数：要在Linux上运行代码，请将Perl部分的第一行更改为

perlCmd = sprintf('"%s"', '/usr/bin/perl');

导入包含文本引号的混合CSV

2 个答案: