我正在使用fgetl
命令来读取.csv
文件,而不是返回我想要的结果:
"HIST",1,1,27,PWH,"1"
它返回时每个字符之间有额外的空格:
" H I S T " , 1 , 1 , 2 7 , P W H , " 1 "
我知道我可以用regexprep
替换空格,但是我的文件包含数十亿行,因此添加的表达式可能会消耗相当多的时间。我有一种感觉,这是一个unicode问题,有人在使用Java时指出了同样的问题,而且它与unicode有关。我想知道是否有人知道更好的方法来处理MATLAB中的问题?
它应该是unicode问题,因为.csv
文件是来自另一个程序的输出,当我使用fgetl
读取它时,会添加空格。但是,如果我再次使用Excel保存.csv
文件并再次使用.csv
读取fgetl
文件,则会返回我想要的结果。
我无法提供示例,因为.csv
文件非常大,我无法制作小样本,因为当我打开并从Excel保存时,这个问题就消失了。
答案 0 :(得分:0)
出于演示目的,我们考虑一个演示文件 - demo.csv
:
"GIST",1,6,17,PWH,"1"
"FIST",0,4,72,WPH,"2"
"MIST",3,2,27,WHP,"3"
你有一些选择:
textscan
(对于具有已知结构的任何文本文件):
fID = fopen('demo.csv');
C = textscan(fID,'%s%d%d%d%s%s','Delimiter',{',','"'},'MultipleDelimsAsOne',1);
fclose(fID);
结果是:
C =
{3x1 cell} [3x1 int32] [3x1 int32] [3x1 int32] {3x1 cell} {3x1 cell}
导入助手+生成脚本(AKA overkill是轻描淡写):
结果是:
%% Import data from text file.
% Script for importing data from the following text file:
%
% F:\demo.csv
%
% To extend the code to different selected data or a different text file, generate a
% function instead of a script.
% Auto-generated by MATLAB on 2016/04/20 19:51:32
%% Initialize variables.
filename = 'F:\demo.csv';
delimiter = ',';
%% Read columns of data as strings:
% For more information, see the TEXTSCAN documentation.
formatSpec = '%q%q%q%q%q%q%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r');
%% Read columns of data according to format string.
% This call is based on the structure of the file used to generate this code. If an error
% occurs for a different file, try regenerating the code from the Import Tool.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'ReturnOnError', false);
%% Close the text file.
fclose(fileID);
%% Convert the contents of columns containing numeric strings to numbers.
% Replace non-numeric strings with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = dataArray{col};
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[2,3,4,6]
% Converts strings in the input cell array to numbers. Replaced non-numeric strings with
% NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1);
% Create a regular expression to detect and remove non-numeric prefixes and suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData{row}, regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if any(numbers==',');
thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'));
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric strings to numbers.
if ~invalidThousandsSeparator;
numbers = textscan(strrep(numbers, ',', ''), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch me
end
end
end
%% Split data into numeric and cell columns.
rawNumericColumns = raw(:, [2,3,4,6]);
rawCellColumns = raw(:, [1,5]);
%% Allocate imported array to column variable names
GIST = rawCellColumns(:, 1);
VarName2 = cell2mat(rawNumericColumns(:, 1));
VarName3 = cell2mat(rawNumericColumns(:, 2));
VarName4 = cell2mat(rawNumericColumns(:, 3));
PWH = rawCellColumns(:, 2);
VarName6 = cell2mat(rawNumericColumns(:, 4));
%% Clear temporary variables
clearvars filename delimiter formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp me rawNumericColumns rawCellColumns;
csvread
(仅限数值;这意味着此处不适用)。答案 1 :(得分:0)
我碰巧有同样的问题。我使用.csv
打开了一个textscan
文件,它在任何字符的两边都添加了1个空格,我还注意到在打开存储读取数据的变量时,字体与Matlab中的常用字体不同。
我们设法通过打开&#39; .csv&#39;来解决这个问题。将文件存入Notepad ++并将编码更改为UTF-8。它解决了这个问题。
希望它有所帮助!