Question

我正在使用fgetl命令来读取.csv文件，而不是返回我想要的结果：

"HIST",1,1,27,PWH,"1"

它返回时每个字符之间有额外的空格：

" H I S T         " , 1 , 1 , 2 7 , P W H , " 1 "

我知道我可以用regexprep替换空格，但是我的文件包含数十亿行，因此添加的表达式可能会消耗相当多的时间。我有一种感觉，这是一个unicode问题，有人在使用Java时指出了同样的问题，而且它与unicode有关。我想知道是否有人知道更好的方法来处理MATLAB中的问题？

更新

它应该是unicode问题，因为.csv文件是来自另一个程序的输出，当我使用fgetl读取它时，会添加空格。但是，如果我再次使用Excel保存.csv文件并再次使用.csv读取fgetl文件，则会返回我想要的结果。

我无法提供示例，因为.csv文件非常大，我无法制作小样本，因为当我打开并从Excel保存时，这个问题就消失了。

Answer 1

出于演示目的，我们考虑一个演示文件 - demo.csv：

"GIST",1,6,17,PWH,"1"
"FIST",0,4,72,WPH,"2"
"MIST",3,2,27,WHP,"3"

你有一些选择：

textscan（对于具有已知结构的任何文本文件）：

fID = fopen('demo.csv');
C = textscan(fID,'%s%d%d%d%s%s','Delimiter',{',','"'},'MultipleDelimsAsOne',1);
fclose(fID);

结果是：

C = 

{3x1 cell}    [3x1 int32]    [3x1 int32]    [3x1 int32]    {3x1 cell}    {3x1 cell}

导入助手+生成脚本（AKA overkill是轻描淡写）：

结果是：

%% Import data from text file.
% Script for importing data from the following text file:
%
%    F:\demo.csv
%
% To extend the code to different selected data or a different text file, generate a
% function instead of a script.

% Auto-generated by MATLAB on 2016/04/20 19:51:32

%% Initialize variables.
filename = 'F:\demo.csv';
delimiter = ',';

%% Read columns of data as strings:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%q%q%q%q%q%q%[^\n\r]';

%% Open the text file.
fileID = fopen(filename,'r');

%% Read columns of data according to format string.
% This call is based on the structure of the file used to generate this code. If an error
% occurs for a different file, try regenerating the code from the Import Tool.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter,  'ReturnOnError', false);

%% Close the text file.
fclose(fileID);

%% Convert the contents of columns containing numeric strings to numbers.
% Replace non-numeric strings with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
  raw(1:length(dataArray{col}),col) = dataArray{col};
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));

for col=[2,3,4,6]
  % Converts strings in the input cell array to numbers. Replaced non-numeric strings with
  % NaN.
  rawData = dataArray{col};
  for row=1:size(rawData, 1);
    % Create a regular expression to detect and remove non-numeric prefixes and suffixes.
    regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
    try
      result = regexp(rawData{row}, regexstr, 'names');
      numbers = result.numbers;

      % Detected commas in non-thousand locations.
      invalidThousandsSeparator = false;
      if any(numbers==',');
        thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
        if isempty(regexp(numbers, thousandsRegExp, 'once'));
          numbers = NaN;
          invalidThousandsSeparator = true;
        end
      end
      % Convert numeric strings to numbers.
      if ~invalidThousandsSeparator;
        numbers = textscan(strrep(numbers, ',', ''), '%f');
        numericData(row, col) = numbers{1};
        raw{row, col} = numbers{1};
      end
    catch me
    end
  end
end


%% Split data into numeric and cell columns.
rawNumericColumns = raw(:, [2,3,4,6]);
rawCellColumns = raw(:, [1,5]);


%% Allocate imported array to column variable names
GIST = rawCellColumns(:, 1);
VarName2 = cell2mat(rawNumericColumns(:, 1));
VarName3 = cell2mat(rawNumericColumns(:, 2));
VarName4 = cell2mat(rawNumericColumns(:, 3));
PWH = rawCellColumns(:, 2);
VarName6 = cell2mat(rawNumericColumns(:, 4));


%% Clear temporary variables
clearvars filename delimiter formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp me rawNumericColumns rawCellColumns;

csvread（仅限数值;这意味着此处不适用）。

Answer 2

我碰巧有同样的问题。我使用.csv打开了一个textscan文件，它在任何字符的两边都添加了1个空格，我还注意到在打开存储读取数据的变量时，字体与Matlab中的常用字体不同。

我们设法通过打开＆＃39; .csv＆＃39;来解决这个问题。将文件存入Notepad ++并将编码更改为UTF-8。它解决了这个问题。

希望它有所帮助！

MATLAB读取UNICODE CSV，字符之间有空格

更新

2 个答案: