假设我有一个制表符分隔的文本文件,其中包含按列排列的数据(带标题)。
有可能将不同的列“堆叠”成类似“工作表”的排列,即有一些分隔符(可能提前或可能不知道)允许不同的列垂直排列。 / p>
是否有一个Perl模块可以帮助将此文本文件中的列数据解析为数据结构(例如,键是列标题的哈希表,值是列数据标量的数组)?
编辑通过“堆叠”,我的意思是一列文本可能包含多个单独的数据“向量”,每个数据都有不同的标题和不同的长度。不可否认,这使解析变得复杂。
编辑我真的不确定混淆在哪里。尽管如此,这是一个例子:
header_one\theader_three
data_1\tdata_7
data_2\tdata_8
data_3\tdata_9
\tdata_10
header_two\tdata_11
data_4\theader_four
data_5\tdata_12
data_6\tdata_13
\tdata_14
该脚本会将其转换为包含四个密钥的哈希表:header_one
,header_two
,header_three
和header_four
,每个键引用一个指向标题下方有data_n
个元素。
答案 0 :(得分:2)
如果可能,我会从DBD::CSV开始,虽然您的“堆叠”要求(我不完全理解)可能需要使用Text::CSV_XS进行一些手动解析。
不要被他们的名字所迷惑 - 他们可以解析任何分隔符,而不仅仅是逗号。
答案 1 :(得分:2)
我认为这与你所说的很接近。如果列数发生变化,则输入将被视为不同的表。可以轻松修改此代码以识别其他标记(例如一行等号),而不是使用列计数。
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
#setup the parser, here we want tab separated and we allow
#loose quoting, so qq/foo\t"bar\tbaz"\tquux/ is
#("foo", "bar\tbaz", "quux")
my $p = Text::CSV_XS->new(
{
sep_char => "\t",
allow_loose_quotes => 1,
}
);
my @stacked;
my $cur = 0;
while (<>) {
$p->parse($_) or die $p->error_input;
my @rec = $p->fields;
#normal case, just add the record to the last
#section in @stacked
if (@rec == $cur) {
push @{$stacked[-1]}, \@rec;
next;
}
#if the number of columns don't match then
#we have a new section
push @stacked, [\@rec];
$cur = @rec; #set the new number of columns
}
for my $table (@stacked) {
print "header: ", join("::", @{$table->[0]}), "\n";
for my $i (1 .. $#$table) {
print "data: ", join("::", @{$table->[$i]}), "\n";
}
print "\n";
}
答案 2 :(得分:-1)
不是很顺利,但我一直这样做:
my $recordType = unpack("A3", $_);
if ($recordType eq "APT")
{
$currentKey = parseFAAAirportAirportRecord($_);
}
elsif ($recordType eq "ATT")
{
parseFAAAirportAttendenceRecord($currentKey, $_);
}
elsif ($recordType eq "RWY")
{
parseFAAAirportRunwayRecord($currentKey, $_);
}
elsif ($recordType eq "RMK")
{
parseFAAAirportRemarkRecord($currentKey, $_);
}
...
sub parseFAAAirportAirportRecord($)
{
my ($line) = @_;
my ($recordType, $datasource_key, $type, $id, $effDate, $faaRegion,
$faaFieldOffice, $state, $stateName, $county, $countyState,
$city, $name, $ownershipType, $facilityUse, $ownersName,
$ownersAddress, $ownersCityStateZip, $ownersPhone, $facilitiesManager,
$managersAddress, $managersCityStateZip, $managersPhone,
$formattedLat, $secondsLat, $formattedLong, $secondsLong,
$refDetermined, $elev, $elevDetermined, $magVar, $magVarEpoch, $tph,
$sectional, $distFromTown, $dirFromTown, $acres,
$bndryARTCC, $bndryARTCCid,
$bndryARTCCname, $respARTCC, $respARTCCid, $respARTCCname,
$fssOnAirport, $fssId, $fssName, $fssPhone, $fssTollFreePhone,
$altFss, $altFssName,
$altFssPhone, $notamFacility, $notamD, $arptActDate,
$arptStatusCode, $arptCert,
$naspAgreementCode, $arptAirspcAnalysed, $aoe, $custLandRights,
$militaryJoint, $militaryRights, $nationalEmergency, $milUse,
$inspMeth, $inspAgency, $lastInsp, $lastInfo, $fuel, $airframeRepairs
,
$engineRepairs, $bottledOyxgen, $bulkOxygen,
$lightingSchedule, $tower, $unicomFreqs, $ctafFreq, $segmentedCircle,
$lens, $landingFee, $isMedical,
$numBasedSEL, $numBasedMEL, $numBasedJet,
$numBasedHelo, $numBasedGliders, $numBasedMilitary,
$numBasedUltraLight,
$numScheduledOperation, $numCommuter, $numAirTaxi,
$numGAlocal, $numGAItinerant,
$numMil, $countEndingDate,
$aptPosSrc, $aptPosSrcDate, $aptElevSrc, $aptElevSrcDate,
$contractFuel, $transientStorage, $otherServices, $windIndicator,
$icaoId) =
unpack("A3 A11 A13 A4 A10 A3 A4 A2 A20 A21 A2 A40 " .
"A42 A2 A2 A35 A72 A45 A16 A35 A72 A45 A16 A15 A12 A15 A12 A1 A5 A1 " .
"A3 A4 A4 A30 A2 A3 A5 A4 A3 A30 A4 A3 A30 A1 A4 A30 A16 A16 " .
"A4 A30 A16 A4 " .
"A1 A7 A2 A15 A7 A13 A1 A1 A1 A1 A18 A6 A2 A1 A8 A8 A40 A5 A5 A8 " .
"A8 A9 A1 A42 A7 A4 A3 A1 A1 A3 A3 A3 A3 A3 A3 A3 " .
"A6 A6 A6 A6 A6 A6 A10" .
"A16 A10 A16 A10 A1 A12 A71 A3 A7", $line);