使用正则表达式获取列式数据

时间:2016-09-30 11:29:23

标签: php regex perl

如何使用正则表达式读取列式数据,我想将这些数据存储到数据库中,以便如何分隔列。

 S/NO         INSULATED TANK SIZE          QTY        U.PRICE(Qr.)     TOTAL PRICE (Qr.)
        FW-50(S) (5 x 5 x 2 MH)
 01                                        1 SET        131,592.00            131,592.00
               w/p(3+2)
        FW-120(S) (10 x 6 x 2 MH) w/p
 02                                        1 SET        252,330.00            252,330.00
                  (5+5)
 03      FW-2(S) (1 x 2 x 1 MH) w/p (1+1)  1 SET        14,471.00             14,471.00

我已经使用Linux命令将PDF转换为文本文件,我想要逐列读取数据然后该怎么办?

1 个答案:

答案 0 :(得分:0)

解析OCR文本输出是一个挫折和角落案例的练习。您需要实际实现解析器来处理可能遇到的不同类型的数据。没有好的方法可以知道你的解析器是正确,因为将来可能会出现更多边缘情况,这会使解决方案脆弱并且可能

有了这个警告,这是你可以采取的一种方式:

#!/usr/bin/env perl

use warnings;
use strict;

use Data::Dumper;
$Data::Dumper::Sortkeys = 1;

my @fields;
my @extra_descriptions;
my @results;
open my $fh, "<input" or die "Unable to open 'input' : $!";
while( <$fh> ) {
    my @data;
    chomp();  # Remove newline
    s|^\s+||; # Remove leading spaces
    s|\s+$||; # Remove trailing spaces

    next unless m|\w|; # Skip empty lines

    @data = split/\s\s+/;  # Split on 2 or more spaces

    # Parse Header
    if ($. == 1) {
        @fields = @data;
        next;
    }

    if (1 == scalar @data) {
        # Extra Size Description
        push @extra_descriptions, shift @data;
        next;
    } elsif ( 4 == scalar @data or 5 == scalar @data ) {            
        my $sn = shift @data;
        my $desc = '';

        # Deal with possibly missing Size info
        if ( 4 == scalar @data ) {
            my $size = shift @data;
            $desc = join(', ', $size, @extra_descriptions);
        } else {
            # 3 columns, so missing Size info

            # Reverse because now main description is last
            $desc = join(', ', reverse @extra_descriptions); 
        }

        unshift(@data, $sn, $desc);

        # Data should be 5 columns
        (5 == scalar @data) or die "Something went wrong with data: " . join("\n",@data);

        # Size (description) should be column 1 ( second column )
        $data[1] =~ m|[FWx]| or die "Could not figure out size! $data[1]";

        my %row;
        my @field_names = qw( serialno size quantity unit_price total_price );
        for my $i ( 0 .. $#field_names ) {
            my $name = $field_names[$i];
            my $desc = $name . "_desc";
            $row{$name} = $data[$i];
            $row{$desc} = $fields[$i];
        }

        # TODO: Insert data into database here
        print Dumper(\%row);

        # Reset
        undef @extra_descriptions;

    } else {
        # Not 1, 4 or 5 columns
        die "Do not know what to do about this row: '$_'";
    }

}

<强>输出

$VAR1 = {
          'quantity' => '1 SET',
          'quantity_desc' => 'QTY',
          'serialno' => '01',
          'serialno_desc' => 'S/NO',
          'size' => 'FW-50(S) (5 x 5 x 2 MH)',
          'size_desc' => 'INSULATED TANK SIZE',
          'total_price' => '131,592.00',
          'total_price_desc' => 'TOTAL PRICE (Qr.)',
          'unit_price' => '131,592.00',
          'unit_price_desc' => 'U.PRICE(Qr.)'
        };
$VAR1 = {
          'quantity' => '1 SET',
          'quantity_desc' => 'QTY',
          'serialno' => '02',
          'serialno_desc' => 'S/NO',
          'size' => 'FW-120(S) (10 x 6 x 2 MH) w/p, w/p(3+2)',
          'size_desc' => 'INSULATED TANK SIZE',
          'total_price' => '252,330.00',
          'total_price_desc' => 'TOTAL PRICE (Qr.)',
          'unit_price' => '252,330.00',
          'unit_price_desc' => 'U.PRICE(Qr.)'
        };
$VAR1 = {
          'quantity' => '1 SET',
          'quantity_desc' => 'QTY',
          'serialno' => '03',
          'serialno_desc' => 'S/NO',
          'size' => 'FW-2(S) (1 x 2 x 1 MH) w/p (1+1), (5+5)',
          'size_desc' => 'INSULATED TANK SIZE',
          'total_price' => '14,471.00',
          'total_price_desc' => 'TOTAL PRICE (Qr.)',
          'unit_price' => '14,471.00',
          'unit_price_desc' => 'U.PRICE(Qr.)'
        };