我有一个文件,其中包含代码和描述的几个子组。我需要解析每个以“ VALUE”开头的部分,直到看到分号“;”为止。
文件是.sas。 VALUE语句告诉我参考数据的类型,下面的所有行都是实例,直到用分号表示该组的结尾为止。我已经写了一些行之有效的东西,但是作为一个老的Java开发人员,它在程序上是非常丑陋的。我敢肯定有一个更有效的perl方法来解决这个问题。高效是指我用许多IF / ELSE语句强制执行验证。
这是我正在解析的.sas文件的摘要(注意:这是一个不完整的文件,但具有用例):
*********************************************************************
MARCH 20, 2018 2:05 PM
This is an example of a SAS program that creates a SAS
file from the 2017 NHIS Public Use HOUSEHLD.DAT ASCII file
This is stored in HOUSEHLD.SAS
*********************************************************************;
* USER NOTE: PLACE NEXT STATEMENT IN SUBSEQUENT PROGRAMS;
LIBNAME NHIS "C:\NHIS2017";
* USER NOTE: PLACE NEXT STATEMENT IN SUBSEQUENT PROGRAMS
IF YOU ALLOW PROGRAM TO PERMANENTLY STORE FORMATS;
LIBNAME LIBRARY "C:\NHIS2017";
FILENAME ASCIIDAT 'C:\NHIS2017\HOUSEHLD.dat';
* DEFINE VARIABLE VALUES FOR REPORTS;
* USE THE STATEMENT "PROC FORMAT LIBRARY=LIBRARY"
TO PERMANENTLY STORE THE FORMAT DEFINITIONS;
* USE THE STATEMENT "PROC FORMAT" IF YOU DO NOT WISH
TO PERMANENTLY STORE THE FORMATS;
PROC FORMAT LIBRARY=LIBRARY;
*PROC FORMAT;
VALUE $GROUPC
' '< - HIGH = "Range of Values"
;
VALUE GROUPN
LOW - HIGH = "Range of Values"
;
VALUE HHP001X
10 = "10 Household"
20 = "20 Person"
25 = "25 Income Imputation"
30 = "30 Sample Adult"
38 = "38 Functioning and Disability"
40 = "40 Sample Child"
60 = "60 Family"
63 = "63 Family Disability Questions"
65 = "65 Paradata"
70 = "70 Injury/Poisoning Episode"
75 = "75 Injury/Poisoning Verbatim"
;
VALUE HHP008X
01 = "01 House, apartment, flat, condo"
02 = "02 HU in nontransient hotel, motel"
03 = "03 HU-permanent in transient hotel, motel"
04 = "04 HU in rooming house"
05 = "05 Mobile home/trailer w/no permanent rooms added"
06 = "06 Mobile home/trailer w/1+ permanent rooms added"
07 = "07 HU not specified above"
08 = "08 Quarters not HU in room or board house"
09 = "09 Unit not permanent-transient hotel, motel"
10 = "10 Unoccupied site for mobile home/trailer/tent"
11 = "11 Student quarters in college dormitory"
12 = "12 Group quarter unit not specified above"
98 = "98 Not ascertained"
;
VALUE HHP009X
1 = "1 Refused"
2 = "2 No one home - repeated calls"
3 = "3 Temporarily absent"
4 = "4 Language problem"
5 = "5 Other"
;
VALUE HHP015X
1 = "1 Northeast"
2 = "2 Midwest"
3 = "3 South"
4 = "4 West"
;
DATA NHIS.HOUSEHLD;
* CREATE A SAS DATA SET;
INFILE ASCIIDAT PAD LRECL=47;
这是我的剧本
#!/usr/bin/perl
# This script looks through a file for the word "VALUE"
# If it finds the word, it will identify the value type and
# then process code/description rows until it finds a semi-colon.
# A semi-colon resets a new search for a value type to begin
use strict;
use warnings;
use diagnostics;
my $file = 'HOUSEHLD.sas';
my $cnt = 0;
my $i = 0;
my $size = 0;
my $valgrp = "";
open my $fh, '<', $file || die "Could not open $file: $!";
while (my $line = <$fh>) {
chomp $line;
$cnt = ($line =~ s/(VALUE )/$1/g);
$line =~ s/^\s+|\s+$//g; #strip leading and trailing spaces
#does the array contain only one instance of 'VALUE'
#check if we are in a refernce value group
if ($valgrp eq "t") {
my @refval = split("=", $line);
if ($line ne ";" ){
print "code: $refval[0]";
print " description: $refval[1]\n";
}
# when you see a semi-colon you are at the end of referecnce block
elsif ($refval[0] eq ";") {
$valgrp ="f";
}
}
if ($cnt == 1) {
my @row = split(" ", $line);
if ( $row[0] eq "VALUE" && scalar(@row) == 2 ) {
print "code type: $row[1]\n";
$valgrp = "t";
}
}
}
close ($fh);
这是预期的(但不是最终输出)。我将创建一个.csv文件,或将其直接放入由VALUE类型创建的MySQL表中。前两个VALUE类型无效,但是在我处理文件时它们在此处。不知道$ GROUPC和GROUPN是否始终是前两个,以及我是否编码某种类型的忽略。
code type: $GROUPC
code: ' '< - HIGH description: "Range of Values"
code type: GROUPN
code: LOW - HIGH description: "Range of Values"
code type: HHP001X
code: 10 description: "10 Household"
code: 20 description: "20 Person"
code: 25 description: "25 Income Imputation"
code: 30 description: "30 Sample Adult"
code: 38 description: "38 Functioning and Disability"
code: 40 description: "40 Sample Child"
code: 60 description: "60 Family"
code: 63 description: "63 Family Disability Questions"
code: 65 description: "65 Paradata"
code: 70 description: "70 Injury/Poisoning Episode"
code: 75 description: "75 Injury/Poisoning Verbatim"
code type: HHP002X
code: . description: '.'
code: OTHER description: "Survey Year"
答案 0 :(得分:2)
这是一种与您相似的方法,经过简化和清理。这项工作很好。
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (%data, $group_val, $in_group);
while (<$fh>)
{
if (/^\s*VALUE\s*(.*)/) {
$group_val = $1;
$in_group = 1;
next;
}
elsif (/^\s*;\s*$/) {
$in_group = 0;
}
next if not $in_group;
my @refval = map { s/^\s+|\s+$//gr } split /\s*=\s*/;
push @{$data{$group_val}}, \@refval;
#say "$group_val: @refval";
}
dd \%data;
我使用Data::Dump打印数据,请根据需要格式化输出。输出符合预期:键HHP001X
的值是一个arrayref,带有arrayref元素([10, '10 Household']
,...)等(我看不到预期的OTHER
是什么)输出是大约,并且在示例中没有看到Survey...
。
我将数据存储在哈希中,以便可以将VALUE
作为键使用,但是如果需要保留它们在文件中的顺序,则我们还需要记录该顺序(在数组中)这样就可以对哈希进行排序,或者使用数组(由arrayrefs组成)而不是哈希来存储数据。
答案 1 :(得分:1)
range operator(..
)在这里很有用。
此示例仅打印出以VALUE开头并以分号结尾的块中的行,以为您提供起点:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
while (<>) {
chomp;
my $match = /^\s*VALUE (\w+)/ .. /^\s*;$/;
if ($match ne "" && $match == 1) {
say "Code type: $1";
} elsif ($match !~ /^$|E0/) {
say $_; # to-do: print out in your 'code: XX description: YY' format
}
}
利用标量范围运算符的返回值来确定当前行是否为VALUE
,结束符(分号):
返回的值或者为false的空字符串,或者为true的序列号(从1开始)。将为遇到的每个范围重置序列号。范围中的最后一个序列号附加了字符串“ E0”,这不会影响其数值,但是如果要排除端点,则会为您提供一些搜索内容。