MATLAB读取UNICODE CSV,字符之间有空格

时间:2016-04-20 13:52:01

标签: matlab csv file-io unicode text-parsing

我正在使用fgetl命令来读取.csv文件,而不是返回我想要的结果:

"HIST",1,1,27,PWH,"1"

它返回时每个字符之间有额外的空格:

" H I S T         " , 1 , 1 , 2 7 , P W H , " 1 " 

我知道我可以用regexprep替换空格,但是我的文件包含数十亿行,因此添加的表达式可能会消耗相当多的时间。我有一种感觉,这是一个unicode问题,有人在使用Java时指出了同样的问题,而且它与unicode有关。我想知道是否有人知道更好的方法来处理MATLAB中的问题?

更新

它应该是unicode问题,因为.csv文件是来自另一个程序的输出,当我使用fgetl读取它时,会添加空格。但是,如果我再次使用Excel保存.csv文件并再次使用.csv读取fgetl文件,则会返回我想要的结果。

我无法提供示例,因为.csv文件非常大,我无法制作小样本,因为当我打开并从Excel保存时,这个问题就消失了。

2 个答案:

答案 0 :(得分:0)

出于演示目的,我们考虑一个演示文件 - demo.csv

"GIST",1,6,17,PWH,"1"
"FIST",0,4,72,WPH,"2"
"MIST",3,2,27,WHP,"3"

你有一些选择:

  • textscan(对于具有已知结构的任何文本文件):

    fID = fopen('demo.csv');
    C = textscan(fID,'%s%d%d%d%s%s','Delimiter',{',','"'},'MultipleDelimsAsOne',1);
    fclose(fID);
    

    结果是:

    C = 
    
    {3x1 cell}    [3x1 int32]    [3x1 int32]    [3x1 int32]    {3x1 cell}    {3x1 cell}
    
  • 导入助手+生成脚本(AKA overkill是轻描淡写):

Screenshot of the import helper in MATLAB R2016a.

结果是:

%% Import data from text file.
% Script for importing data from the following text file:
%
%    F:\demo.csv
%
% To extend the code to different selected data or a different text file, generate a
% function instead of a script.

% Auto-generated by MATLAB on 2016/04/20 19:51:32

%% Initialize variables.
filename = 'F:\demo.csv';
delimiter = ',';

%% Read columns of data as strings:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%q%q%q%q%q%q%[^\n\r]';

%% Open the text file.
fileID = fopen(filename,'r');

%% Read columns of data according to format string.
% This call is based on the structure of the file used to generate this code. If an error
% occurs for a different file, try regenerating the code from the Import Tool.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter,  'ReturnOnError', false);

%% Close the text file.
fclose(fileID);

%% Convert the contents of columns containing numeric strings to numbers.
% Replace non-numeric strings with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
  raw(1:length(dataArray{col}),col) = dataArray{col};
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));

for col=[2,3,4,6]
  % Converts strings in the input cell array to numbers. Replaced non-numeric strings with
  % NaN.
  rawData = dataArray{col};
  for row=1:size(rawData, 1);
    % Create a regular expression to detect and remove non-numeric prefixes and suffixes.
    regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
    try
      result = regexp(rawData{row}, regexstr, 'names');
      numbers = result.numbers;

      % Detected commas in non-thousand locations.
      invalidThousandsSeparator = false;
      if any(numbers==',');
        thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
        if isempty(regexp(numbers, thousandsRegExp, 'once'));
          numbers = NaN;
          invalidThousandsSeparator = true;
        end
      end
      % Convert numeric strings to numbers.
      if ~invalidThousandsSeparator;
        numbers = textscan(strrep(numbers, ',', ''), '%f');
        numericData(row, col) = numbers{1};
        raw{row, col} = numbers{1};
      end
    catch me
    end
  end
end


%% Split data into numeric and cell columns.
rawNumericColumns = raw(:, [2,3,4,6]);
rawCellColumns = raw(:, [1,5]);


%% Allocate imported array to column variable names
GIST = rawCellColumns(:, 1);
VarName2 = cell2mat(rawNumericColumns(:, 1));
VarName3 = cell2mat(rawNumericColumns(:, 2));
VarName4 = cell2mat(rawNumericColumns(:, 3));
PWH = rawCellColumns(:, 2);
VarName6 = cell2mat(rawNumericColumns(:, 4));


%% Clear temporary variables
clearvars filename delimiter formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp me rawNumericColumns rawCellColumns;
  • csvread(仅限数值;这意味着此处不适用)。

答案 1 :(得分:0)

我碰巧有同样的问题。我使用.csv打开了一个textscan文件,它在任何字符的两边都添加了1个空格,我还注意到在打开存储读取数据的变量时,字体与Matlab中的常用字体不同。

我们设法通过打开&#39; .csv&#39;来解决这个问题。将文件存入Notepad ++并将编码更改为UTF-8。它解决了这个问题。

希望它有所帮助!