是否有用于解析柱状文本的Perl模块?

时间:2009-04-21 22:10:27

标签: perl parsing module

假设我有一个制表符分隔的文本文件,其中包含按列排列的数据(带标题)。

有可能将不同的列“堆叠”成类似“工作表”的排列,即有一些分隔符(可能提前或可能不知道)允许不同的列垂直排列。 / p>

是否有一个Perl模块可以帮助将此文本文件中的列数据解析为数据结构(例如,键是列标题的哈希表,值是列数据标量的数组)?

编辑通过“堆叠”,我的意思是一列文本可能包含多个单独的数据“向量”,每个数据都有不同的标题和不同的长度。不可否认,这使解析变得复杂。

编辑我真的不确定混淆在哪里。尽管如此,这是一个例子:

header_one\theader_three
data_1\tdata_7
data_2\tdata_8
data_3\tdata_9
\tdata_10
header_two\tdata_11
data_4\theader_four
data_5\tdata_12
data_6\tdata_13
\tdata_14

该脚本会将其转换为包含四个密钥的哈希表:header_oneheader_twoheader_threeheader_four,每个键引用一个指向标题下方有data_n个元素。

3 个答案:

答案 0 :(得分:2)

如果可能,我会从DBD::CSV开始,虽然您的“堆叠”要求(我不完全理解)可能需要使用Text::CSV_XS进行一些手动解析。

不要被他们的名字所迷惑 - 他们可以解析任何分隔符,而不仅仅是逗号。

答案 1 :(得分:2)

我认为这与你所说的很接近。如果列数发生变化,则输入将被视为不同的表。可以轻松修改此代码以识别其他标记(例如一行等号),而不是使用列计数。

#!/usr/bin/perl

use strict;
use warnings;

use Text::CSV_XS;

#setup the parser, here we want tab separated and we allow
#loose quoting, so qq/foo\t"bar\tbaz"\tquux/ is 
#("foo", "bar\tbaz", "quux")
my $p = Text::CSV_XS->new(
    {
        sep_char           => "\t",
        allow_loose_quotes => 1,
    }
);

my @stacked;
my $cur = 0;
while (<>) {
    $p->parse($_) or die $p->error_input;
    my @rec = $p->fields;
    #normal case, just add the record to the last
    #section in @stacked
    if (@rec == $cur) {
        push @{$stacked[-1]}, \@rec;
        next;
    }
    #if the number of columns don't match then
    #we have a new section
    push @stacked, [\@rec];
    $cur = @rec; #set the new number of columns
}

for my $table (@stacked) {
    print "header: ", join("::", @{$table->[0]}), "\n";
    for my $i (1 .. $#$table) {
        print "data: ", join("::", @{$table->[$i]}), "\n";
    }
    print "\n";
}

答案 2 :(得分:-1)

不是很顺利,但我一直这样做:

        my $recordType = unpack("A3", $_);

        if ($recordType eq "APT")
        {
            $currentKey = parseFAAAirportAirportRecord($_);
        }
        elsif ($recordType eq "ATT")
        {
            parseFAAAirportAttendenceRecord($currentKey, $_);
        }
        elsif ($recordType eq "RWY")
        {
            parseFAAAirportRunwayRecord($currentKey, $_);
        }
        elsif ($recordType eq "RMK")
        {
            parseFAAAirportRemarkRecord($currentKey, $_);
        }
...
sub parseFAAAirportAirportRecord($)
{
    my ($line) = @_;

    my ($recordType, $datasource_key, $type, $id, $effDate, $faaRegion,
        $faaFieldOffice, $state, $stateName, $county, $countyState,
        $city, $name, $ownershipType, $facilityUse, $ownersName,
        $ownersAddress, $ownersCityStateZip, $ownersPhone, $facilitiesManager,
        $managersAddress, $managersCityStateZip, $managersPhone,
        $formattedLat, $secondsLat, $formattedLong, $secondsLong,
        $refDetermined, $elev, $elevDetermined, $magVar, $magVarEpoch, $tph,
        $sectional, $distFromTown, $dirFromTown, $acres,
        $bndryARTCC, $bndryARTCCid,
        $bndryARTCCname, $respARTCC, $respARTCCid, $respARTCCname,
        $fssOnAirport, $fssId, $fssName, $fssPhone, $fssTollFreePhone,
        $altFss, $altFssName,
        $altFssPhone, $notamFacility, $notamD, $arptActDate,
        $arptStatusCode, $arptCert,
        $naspAgreementCode, $arptAirspcAnalysed, $aoe, $custLandRights,
        $militaryJoint, $militaryRights, $nationalEmergency, $milUse,
        $inspMeth, $inspAgency, $lastInsp, $lastInfo, $fuel, $airframeRepairs
,
        $engineRepairs, $bottledOyxgen, $bulkOxygen,
        $lightingSchedule, $tower, $unicomFreqs, $ctafFreq, $segmentedCircle,
        $lens, $landingFee, $isMedical,
        $numBasedSEL, $numBasedMEL, $numBasedJet,
        $numBasedHelo, $numBasedGliders, $numBasedMilitary,
        $numBasedUltraLight,
        $numScheduledOperation, $numCommuter, $numAirTaxi,
        $numGAlocal, $numGAItinerant,
        $numMil, $countEndingDate,
        $aptPosSrc, $aptPosSrcDate, $aptElevSrc, $aptElevSrcDate,
        $contractFuel, $transientStorage, $otherServices, $windIndicator,
        $icaoId) =
        unpack("A3 A11 A13 A4 A10 A3 A4 A2 A20 A21 A2 A40 " .
        "A42 A2 A2 A35 A72 A45 A16 A35 A72 A45 A16 A15 A12 A15 A12 A1 A5 A1 " .
        "A3 A4 A4 A30 A2 A3 A5 A4 A3 A30 A4 A3 A30 A1 A4 A30 A16 A16 " .
        "A4 A30 A16 A4 " .
        "A1 A7 A2 A15 A7 A13 A1 A1 A1 A1 A18 A6 A2 A1 A8 A8 A40 A5 A5 A8 " .
        "A8 A9 A1 A42 A7 A4 A3 A1 A1 A3 A3 A3 A3 A3 A3 A3 " .
        "A6 A6 A6 A6 A6 A6 A10" .
        "A16 A10 A16 A10 A1 A12 A71 A3 A7", $line);