如何使用正则表达式读取列式数据,我想将这些数据存储到数据库中,以便如何分隔列。
S/NO INSULATED TANK SIZE QTY U.PRICE(Qr.) TOTAL PRICE (Qr.)
FW-50(S) (5 x 5 x 2 MH)
01 1 SET 131,592.00 131,592.00
w/p(3+2)
FW-120(S) (10 x 6 x 2 MH) w/p
02 1 SET 252,330.00 252,330.00
(5+5)
03 FW-2(S) (1 x 2 x 1 MH) w/p (1+1) 1 SET 14,471.00 14,471.00
我已经使用Linux命令将PDF转换为文本文件,我想要逐列读取数据然后该怎么办?
答案 0 :(得分:0)
解析OCR文本输出是一个挫折和角落案例的练习。您需要实际实现解析器来处理可能遇到的不同类型的数据。没有好的方法可以知道你的解析器是正确,因为将来可能会出现更多边缘情况,这会使解决方案脆弱并且可能车强>
有了这个警告,这是你可以采取的一种方式:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
my @fields;
my @extra_descriptions;
my @results;
open my $fh, "<input" or die "Unable to open 'input' : $!";
while( <$fh> ) {
my @data;
chomp(); # Remove newline
s|^\s+||; # Remove leading spaces
s|\s+$||; # Remove trailing spaces
next unless m|\w|; # Skip empty lines
@data = split/\s\s+/; # Split on 2 or more spaces
# Parse Header
if ($. == 1) {
@fields = @data;
next;
}
if (1 == scalar @data) {
# Extra Size Description
push @extra_descriptions, shift @data;
next;
} elsif ( 4 == scalar @data or 5 == scalar @data ) {
my $sn = shift @data;
my $desc = '';
# Deal with possibly missing Size info
if ( 4 == scalar @data ) {
my $size = shift @data;
$desc = join(', ', $size, @extra_descriptions);
} else {
# 3 columns, so missing Size info
# Reverse because now main description is last
$desc = join(', ', reverse @extra_descriptions);
}
unshift(@data, $sn, $desc);
# Data should be 5 columns
(5 == scalar @data) or die "Something went wrong with data: " . join("\n",@data);
# Size (description) should be column 1 ( second column )
$data[1] =~ m|[FWx]| or die "Could not figure out size! $data[1]";
my %row;
my @field_names = qw( serialno size quantity unit_price total_price );
for my $i ( 0 .. $#field_names ) {
my $name = $field_names[$i];
my $desc = $name . "_desc";
$row{$name} = $data[$i];
$row{$desc} = $fields[$i];
}
# TODO: Insert data into database here
print Dumper(\%row);
# Reset
undef @extra_descriptions;
} else {
# Not 1, 4 or 5 columns
die "Do not know what to do about this row: '$_'";
}
}
<强>输出强>
$VAR1 = {
'quantity' => '1 SET',
'quantity_desc' => 'QTY',
'serialno' => '01',
'serialno_desc' => 'S/NO',
'size' => 'FW-50(S) (5 x 5 x 2 MH)',
'size_desc' => 'INSULATED TANK SIZE',
'total_price' => '131,592.00',
'total_price_desc' => 'TOTAL PRICE (Qr.)',
'unit_price' => '131,592.00',
'unit_price_desc' => 'U.PRICE(Qr.)'
};
$VAR1 = {
'quantity' => '1 SET',
'quantity_desc' => 'QTY',
'serialno' => '02',
'serialno_desc' => 'S/NO',
'size' => 'FW-120(S) (10 x 6 x 2 MH) w/p, w/p(3+2)',
'size_desc' => 'INSULATED TANK SIZE',
'total_price' => '252,330.00',
'total_price_desc' => 'TOTAL PRICE (Qr.)',
'unit_price' => '252,330.00',
'unit_price_desc' => 'U.PRICE(Qr.)'
};
$VAR1 = {
'quantity' => '1 SET',
'quantity_desc' => 'QTY',
'serialno' => '03',
'serialno_desc' => 'S/NO',
'size' => 'FW-2(S) (1 x 2 x 1 MH) w/p (1+1), (5+5)',
'size_desc' => 'INSULATED TANK SIZE',
'total_price' => '14,471.00',
'total_price_desc' => 'TOTAL PRICE (Qr.)',
'unit_price' => '14,471.00',
'unit_price_desc' => 'U.PRICE(Qr.)'
};