Matlab - 帮助使用文本扫描,如何忽略注释和标题列?

时间:2013-09-13 17:46:45

标签: matlab textscan textreader tab-delimited-text

我需要使用文本扫描的帮助。我正在尝试读取具有以下格式的数据:

# ---------------------------------- WARNING ----------------------------------------
# The data you have obtained from this automated U.S. Geological Survey database
# have not received Director's approval and as such are provisional and subject to
# revision.  The data are released on the condition that neither the USGS nor the
# United States Government may be held liable for any damages resulting from its use.
# Additional info: http://nwis.waterdata.usgs.gov/nwis/help/?provisional
#
# File-format description:  http://nwis.waterdata.usgs.gov/nwis/?tab_delimited_format_info
# Automated-retrieval info: http://nwis.waterdata.usgs.gov/nwis/?automated_retrieval_info
#
# Contact:   gs-w_support_nwisweb@usgs.gov
# retrieved: 2013-09-13 13:10:29 EDT       (nadww01)
#
# Data for the following 1 site(s) are contained in this file
#    USGS 08067074 CWA Canal at Thompson Rd nr Baytown, TX
# -----------------------------------------------------------------------------------
#
# Data provided for site 08067074
#    DD parameter   Description
#    01   00010     Temperature, water, degrees Celsius
#    02   00095     Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celsius
#
# Data-value qualification codes included in this output: 
#     A  Approved for publication -- Processing and review completed.  
#     P  Provisional data subject to revision.  
# 
agency_cd   site_no datetime    tz_cd   01_00010    01_00010_cd 02_00095    02_00095_cd
5s  15s 20d 6s  14n 10s 14n 10s
USGS    08067074    2013-01-05 00:00    CST 10.3    A   391 A
USGS    08067074    2013-01-05 00:15    CST 10.3    A   391 A
USGS    08067074    2013-01-05 00:30    CST 10.3    A   391 A
USGS    08067074    2013-01-05 00:45    CST 10.3    A   391 A
USGS    08067074    2013-01-05 01:00    CST 10.3    A   391 A
USGS    08067074    2013-01-05 01:15    CST 10.3    A   391 A
USGS    08067074    2013-01-05 01:30    CST 10.3    A   391 A
USGS    08067074    2013-01-05 01:45    CST 10.3    A   391 A
USGS    08067074    2013-01-05 02:00    CST 10.3    A   391 A
USGS    08067074    2013-01-05 02:15    CST 10.3    A   391 A
USGS    08067074    2013-01-05 02:30    CST 10.3    A   391 A
USGS    08067074    2013-01-05 02:45    CST 10.2    A   391 A
USGS    08067074    2013-01-05 03:00    CST 10.2    A   391 A
USGS    08067074    2013-01-05 03:15    CST 10.2    A   391 A
USGS    08067074    2013-01-05 03:30    CST 10.2    A   391 A
USGS    08067074    2013-01-05 03:45    CST 10.2    A   391 A
USGS    08067074    2013-01-05 04:00    CST 10.2    A   391 A
USGS    08067074    2013-01-05 04:15    CST 10.2    A   392 A
USGS    08067074    2013-01-05 04:30    CST 10.2    A   391 A
USGS    08067074    2013-01-05 04:45    CST 10.2    A   391 A
USGS    08067074    2013-01-05 05:00    CST 10.2    A   391 A
USGS    08067074    2013-01-05 05:15    CST 10.2    A   391 A
USGS    08067074    2013-01-05 05:30    CST 10.2    A   391 A
USGS    08067074    2013-01-05 05:45    CST 10.2    A   391 A
USGS    08067074    2013-01-05 06:00    CST 10.2    A   391 A
USGS    08067074    2013-01-05 06:15    CST 10.1    A   391 A
USGS    08067074    2013-01-05 06:30    CST 10.1    A   391 A
USGS    08067074    2013-01-05 06:45    CST 10.1    A   391 A
USGS    08067074    2013-01-05 07:00    CST 10.1    A   391 A
USGS    08067074    2013-01-05 07:15    CST 10.1    A   391 A
USGS    08067074    2013-01-05 07:30    CST 10.1    A   390 A
USGS    08067074    2013-01-05 07:45    CST 10.0    A   391 A
USGS    08067074    2013-01-05 08:00    CST 10.0    A   390 A
USGS    08067074    2013-01-05 08:15    CST 10.0    A   391 A
USGS    08067074    2013-01-05 08:30    CST 10.0    A   391 A
USGS    08067074    2013-01-05 08:45    CST 10.0    A   390 A
USGS    08067074    2013-01-05 09:00    CST 10.0    A   390 A
USGS    08067074    2013-01-05 09:15    CST 10  A   390 A
USGS    08067074    2013-01-05 09:30    CST 10  A   390 A
USGS    08067074    2013-01-05 09:45    CST 10  A   390 A
USGS    08067074    2013-01-05 10:00    CST 10  A   390 A
USGS    08067074    2013-01-05 10:15    CST 10  A   390 A
USGS    08067074    2013-01-05 10:30    CST 10  A   390 A
USGS    08067074    2013-01-05 10:45    CST 10  A   390 A
USGS    08067074    2013-01-05 11:00    CST 10  A   390 A
USGS    08067074    2013-01-05 11:15    CST 10  A   390 A
USGS    08067074    2013-01-05 11:30    CST 10  A   390 A
USGS    08067074    2013-01-05 11:45    CST 10  A   389 A
USGS    08067074    2013-01-05 12:00    CST 10  A   389 A
USGS    08067074    2013-01-05 12:15    CST 10  A   389 A
USGS    08067074    2013-01-05 12:30    CST 10  A   389 A
USGS    08067074    2013-01-05 12:45    CST 10  A   389 A
USGS    08067074    2013-01-05 13:00    CST 10  A   389 A
USGS    08067074    2013-01-05 13:15    CST 10  A   389 A
USGS    08067074    2013-01-05 13:30    CST 10  A   389 A

我唯一关注的两个数据条目是“特定电导”和“日期”。 (分别为第3和第7列)

我能够使用以下代码在一致的基础上执行此操作:

%% 
% Collect conductance data
filename = 'conductivityData_Temp_File';


%%
% Determine length of data file 
fid = fopen('conductivityData_Temp_File','r');
fseek(fid, 0, 'eof');
chunksize = ftell(fid);
fseek(fid, 0, 'bof');
ch = fread(fid, chunksize, '*uchar');
N = sum(ch == sprintf('\n')); % number of lines
fclose(fid)

%% 
% Read conductivity data
fileconductID = fopen(filename);
waterConductivityData = textscan(fileconductID, '%s %d %s %s %f %s %f %s', N, 'delimiter', '\t', 'EmptyValue', 0, 'headerlines', 27);
fclose(fileconductID);

然而,我发现你可以简单地使用'commentstyle'来忽略评论。这很重要,因为我正在阅读多个文件,偶尔我会遇到一个文件,它没有正好有27个注释行。这将使我的程序抛出错误。

有人可以告诉我如何调整我的文本扫描代码以忽略注释行并跳过两个标题行吗?

如果我提供的示例代码很复杂,我很抱歉,但基本上我的错误存在于这一行代码中:

waterConductivityData = textscan(fileconductID, '%s %d %s %s %f %s %f %s', N, 'delimiter', '\t', 'EmptyValue', 0, 'headerlines', 27);

(如果您想使用此链接,请使用示例制表符分隔文件:here

谢谢!

解答:


谢谢TryHard,这是一个很好的方法,但我想更接近我以前做的事情。显然我的分隔符已关闭。

waterConductivityData = textscan(fileconductID,'%s %s %s %s %s %s %s %s %s ' , 'Delimiter', '\t', 'CommentStyle', '#');

dates = waterConductivityData{3}(3:end);
conductancesStr = waterConductivityData{7}(3:end);
temperaturesStr = waterConductivityData{5}(3:end);

conductances = str2double(conductancesStr);
temperatures = str2double(temperaturesStr);

2 个答案:

答案 0 :(得分:1)

将您的文本扫描行更改为:

waterConductivityData = textscan(fileconductID, '%s %d %s %s %f %s %f %s', N, 'Delimiter', '\t', 'EmptyValue', 0, 'CommentStyle', '#');

然后得到你想要的列:

dates = waterConductivityData{3}(3:end)
conductances = waterConductivityData{7}(3:end)

答案 1 :(得分:0)

绕过变量头长度的一种方法是按如下方式解析文件:

fid=fopen(file);
str=textscan(fid,'%s')
fclose(fid)

str2=strvcat(str{1});
fst=strmatch('CST',str2);

dtstr = str2(fst(1)-2:9:end,:);   % date strings
timstr = str2(fst(1)-1:9:end,:);  % time strings
condctv = str2(fst(1)+3:9:end,:); % conductivity string

这会找到第一次出现的字符串" CST"并解析假设数据行在所有文件中的组织方式相似,并且需要" CST"发生在表的第一个数据行中。如果这不是数据文件中的常量,那么这个想法就是破产。但是,您可以使用其他字符串来绘制数据表的方式,假设它们是唯一的并且始终显示在同一位置。以下内容使用格式字符串中的最后一个格式说明符:

str2=strvcat(str{1});
fst=strmatch('10s',str2);
fst=fst(end);

dtstr = str2(fst+3:9:end,:);
timstr = str2(fst+4:9:end,:);
condctv = str2(fst+8:9:end,:);

您可以将condctv中的字符串转换为str2num的数字数据,如下所示:

condctv = str2num(conductv);