将文件的子部分解析为独立的组

时间:2019-02-18 23:32:08

标签: regex perl

我有一个文件,其中包含代码和描述的几个子组。我需要解析每个以“ VALUE”开头的部分,直到看到分号“;”为止。

文件是.sas。 VALUE语句告诉我参考数据的类型,下面的所有行都是实例,直到用分号表示该组的结尾为止。我已经写了一些行之有效的东西,但是作为一个老的Java开发人员,它在程序上是非常丑陋的。我敢肯定有一个更有效的perl方法来解决这个问题。高效是指我用许多IF / ELSE语句强制执行验证。

这是我正在解析的.sas文件的摘要(注意:这是一个不完整的文件,但具有用例):

*********************************************************************
 MARCH 20, 2018  2:05 PM

 This is an example of a SAS program that creates a SAS
 file from the 2017 NHIS Public Use HOUSEHLD.DAT ASCII file

 This is stored in HOUSEHLD.SAS
*********************************************************************;

* USER NOTE: PLACE NEXT STATEMENT IN SUBSEQUENT PROGRAMS;
LIBNAME  NHIS     "C:\NHIS2017";

* USER NOTE: PLACE NEXT STATEMENT IN SUBSEQUENT PROGRAMS
             IF YOU ALLOW PROGRAM TO PERMANENTLY STORE FORMATS;
LIBNAME  LIBRARY  "C:\NHIS2017";

FILENAME ASCIIDAT 'C:\NHIS2017\HOUSEHLD.dat';

* DEFINE VARIABLE VALUES FOR REPORTS;

*  USE THE STATEMENT "PROC FORMAT LIBRARY=LIBRARY"
     TO PERMANENTLY STORE THE FORMAT DEFINITIONS;

*  USE THE STATEMENT "PROC FORMAT" IF YOU DO NOT WISH
      TO PERMANENTLY STORE THE FORMATS;

PROC FORMAT LIBRARY=LIBRARY;
*PROC FORMAT;

   VALUE $GROUPC
      ' '< - HIGH   = "Range of Values"
   ;
   VALUE GROUPN
      LOW - HIGH   = "Range of Values"
   ;
   VALUE HHP001X
      10                 = "10 Household"
      20                 = "20 Person"
      25                 = "25 Income Imputation"
      30                 = "30 Sample Adult"
      38                 = "38 Functioning and Disability"
      40                 = "40 Sample Child"
      60                 = "60 Family"
      63                 = "63 Family Disability Questions"
      65                 = "65 Paradata"
      70                 = "70 Injury/Poisoning Episode"
      75                 = "75 Injury/Poisoning Verbatim"
   ;

   VALUE HHP008X
      01                 = "01 House, apartment, flat, condo"
      02                 = "02 HU in nontransient hotel, motel"
      03                 = "03 HU-permanent in transient hotel, motel"
      04                 = "04 HU in rooming house"
      05                 = "05 Mobile home/trailer w/no permanent rooms added"
      06                 = "06 Mobile home/trailer w/1+ permanent rooms added"
      07                 = "07 HU not specified above"
      08                 = "08 Quarters not HU in room or board house"
      09                 = "09 Unit not permanent-transient hotel, motel"
      10                 = "10 Unoccupied site for mobile home/trailer/tent"
      11                 = "11 Student quarters in college dormitory"
      12                 = "12 Group quarter unit not specified above"
      98                 = "98 Not ascertained"
   ;
   VALUE HHP009X
      1                  = "1 Refused"
      2                  = "2 No one home - repeated calls"
      3                  = "3 Temporarily absent"
      4                  = "4 Language problem"
      5                  = "5 Other"
   ;
   VALUE HHP015X
      1                  = "1 Northeast"
      2                  = "2 Midwest"
      3                  = "3 South"
      4                  = "4 West"
   ;

DATA NHIS.HOUSEHLD;
   * CREATE A SAS DATA SET;
   INFILE ASCIIDAT PAD LRECL=47;

这是我的剧本

#!/usr/bin/perl

# This script looks through a file for the word "VALUE"
# If it finds the word, it will identify the value type and
# then process code/description rows until it finds a semi-colon. 
# A semi-colon resets a new search for a value type to begin

use strict;
use warnings;
use diagnostics;

my $file = 'HOUSEHLD.sas';
my $cnt = 0; 
my $i = 0;
my $size = 0;
my $valgrp = "";

open my $fh, '<', $file || die "Could not open $file: $!";

while (my $line = <$fh>) { 
    chomp $line; 

    $cnt = ($line =~ s/(VALUE )/$1/g);

    $line =~ s/^\s+|\s+$//g; #strip leading and trailing spaces


    #does the array contain only one instance of 'VALUE'
    #check if we are in a refernce value group
    if ($valgrp eq "t") {
        my @refval = split("=", $line); 
        if ($line ne ";" ){
            print "code: $refval[0]";
            print " description: $refval[1]\n";
        }
        # when you see a semi-colon you are at the end of referecnce block
        elsif ($refval[0] eq ";") { 
            $valgrp ="f";
        }
    }

    if ($cnt == 1) {

        my @row = split(" ", $line);    

        if ( $row[0] eq "VALUE" && scalar(@row) == 2 ) {
            print "code type: $row[1]\n";
            $valgrp = "t";
        }

    }

}

close ($fh);

这是预期的(但不是最终输出)。我将创建一个.csv文件,或将其直接放入由VALUE类型创建的MySQL表中。前两个VALUE类型无效,但是在我处理文件时它们在此处。不知道$ GROUPC和GROUPN是否始终是前两个,以及我是否编码某种类型的忽略。

code type: $GROUPC
code: ' '< - HIGH    description:  "Range of Values"
code type: GROUPN
code: LOW - HIGH    description:  "Range of Values"
code type: HHP001X
code: 10                  description:  "10 Household"
code: 20                  description:  "20 Person"
code: 25                  description:  "25 Income Imputation"
code: 30                  description:  "30 Sample Adult"
code: 38                  description:  "38 Functioning and Disability"
code: 40                  description:  "40 Sample Child"
code: 60                  description:  "60 Family"
code: 63                  description:  "63 Family Disability Questions"
code: 65                  description:  "65 Paradata"
code: 70                  description:  "70 Injury/Poisoning Episode"
code: 75                  description:  "75 Injury/Poisoning Verbatim"
code type: HHP002X
code: .                    description:  '.'
code: OTHER               description:  "Survey Year"

2 个答案:

答案 0 :(得分:2)

这是一种与您相似的方法,经过简化和清理。这项工作很好。

use warnings;
use strict;
use feature 'say';

use Data::Dump qw(dd);

my $file = shift || die "Usage: $0 file\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my (%data, $group_val, $in_group);

while (<$fh>) 
{
    if (/^\s*VALUE\s*(.*)/) {
        $group_val = $1;
        $in_group = 1;
        next;
    }
    elsif (/^\s*;\s*$/) {
        $in_group = 0;
    }    
    next if not $in_group;

    my @refval = map { s/^\s+|\s+$//gr } split /\s*=\s*/;

    push @{$data{$group_val}}, \@refval;

    #say "$group_val: @refval";
}

dd \%data;

我使用Data::Dump打印数据,请根据需要格式化输出。输出符合预期:键HHP001X的值是一个arrayref,带有arrayref元素([10, '10 Household'],...)等(我看不到预期的OTHER是什么)输出是大约,并且在示例中没有看到Survey...

我将数据存储在哈希中,以便可以将VALUE作为键使用,但是如果需要保留它们在文件中的顺序,则我们还需要记录该顺序(在数组中)这样就可以对哈希进行排序,或者使用数组(由arrayrefs组成)而不是哈希来存储数据。

答案 1 :(得分:1)

range operator..)在这里很有用。

此示例仅打印出以VALUE开头并以分号结尾的块中的行,以为您提供起点:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;

while (<>) {
  chomp;
  my $match = /^\s*VALUE (\w+)/ .. /^\s*;$/;
  if ($match ne "" && $match == 1) {
    say "Code type: $1";
  } elsif ($match !~ /^$|E0/) {
    say $_; # to-do: print out in your 'code: XX description: YY' format
  }
}

利用标量范围运算符的返回值来确定当前行是否为VALUE,结束符(分号):

  

返回的值或者为false的空字符串,或者为true的序列号(从1开始)。将为遇到的每个范围重置序列号。范围中的最后一个序列号附加了字符串“ E0”,这不会影响其数值,但是如果要排除端点,则会为您提供一些搜索内容。