使用PowerShell,Regex,itextsharp.dll在PDF中查找特定字段

时间:2015-11-27 19:17:45

标签: regex powershell pdf itextsharp

对于RegEx来说,我是一个新手,但是在过去的几个小时里我一直试图弄清楚如何使用PowerShell和itextsharp.dll从PDF解析一些数据。我打算在itextsharp论坛上发帖,但我实际上并没有在那里找到帮助的地方。对于已经了解RegEx的人来说,只是一堆操作方法。

PDF表格如下所示: enter image description here

itextsharp.dll输出如下所示:

Selection Criteria Report parameters
Select all Bottles where
Date Loaded - Date/Time (Bottle) is after or equal to '11/20/2015 15:50'
AND
Date Loaded - Date/Time (Bottle) is before or equal to '11/20/2015
16:10'
N/A
Unit # Status Determined Bottle ID Time to Find Cell
=W00000000000001 Negative 11/25/2015 16:08 AAAACNSJ 5 2D55
=W00000000000002 Negative 11/25/2015 16:08 AAAACNSA 5 2D56
1291231 Negative 11/25/2015 16:08 AAAACNB 5 2D57
=W00000000000003 Positive 11/25/2015 16:08 AAAACNS9 5 2D58
1981231 Negative 11/25/2015 16:09 AAAACNSG 5 2D59
=W00000000000004 Negative 11/25/2015 16:10 AAAACNS7 5 2D60
Report
Reviewed By: Printed for manual signature
Page 1 of 1 11/25/2015 16:15

我一直在使用以下代码和各种不同的RegEx表达式来尝试仅解析表数据并将每个列设置为变量。我已经省略了我尝试过的所有不同的东西,因为有太多的东西,我真的不知道我在做什么,因为数据的方式。

 for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{

    $strategy = new-object  'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'            
    $currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
    [string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default  , [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));    
    $Line = $text -Split "`n"
    $i = 0
    Do {    
        If ($Line[$i] -match '(?m)^(?<unit_id>=?\w+)\s+(?<status>\w+)\s+(?<determined>\d{2}\/\d{2}\/\d{4}\s+‌​\d{2}:\d{2})\s+(?<bottle_id>\w+)\s+(?<time_to_find>\d)+\s+(?<cell>\w+)$') {
            Write-Host $Line[$i]
        }
        $i = $i + 1
    }
    While ($Line[$i])
}
$Reader.Close();

那里有没有人可以帮助我将所有这些列设置为变量?任何帮助将不胜感激。谢谢!

1 个答案:

答案 0 :(得分:1)

这是一个应该正确解析1行字符串的示例正则表达式:

$text = '=W03651532551000 Negative 11/25/2015 16:08 PAGYCNQ6 5 2D56'
$text -match '^(?<unit_id>=?\w+)\s+(?<status>\w+)\s+(?<determined>[\/\d\s:]+)\s+(?<bottle_id>\w+)\s+(?<time_to_find>\d+)\s+(?<cell>\w+)$'
$matches

输出:

Name                           Value
----                           -----
determined                     11/25/2015 16:08
cell                           2D56
status                         Negative
bottle_id                      PAGYCNQ6
time_to_find                   5
unit_id                        =W03651532551000
0                              =W03651532551000 Negative 11/25/2015 16:08 PAGYCNQ6 5 2D56

这是更复杂的一个:

$objcol = @()
$text = "=W03651532551000 Negative 11/25/2015 16:08 PAGYCNQ6 5 2D56`nLW03651532551000 Positive 11/25/2015 16:08 PAGYCNQ6 5 2D56"
$res = $text.Split("`n") | where {
 $_ -match '(?<unit_id>=?\w+)\s+(?<status>\w+)\s+(?<determined>\d{2}\/\d{2}\/\d{4}\s+\d{2}:\d{2})\s+(?<bottle_id>\w+)\s+(?<time_to_find>\d+)\s+(?<cell>\w+)' 
} | foreach {
   $obj = new-object PSObject –prop @{ 
    unitId=$matches['unit_id']; status=$matches['status']; 
    Determined=$matches['determined']; bottleId=$matches['bottle_id']; 
    timeToFind=$matches['time_to_find'] 
  }
  $objcol += $obj
 }
Write-Output $objcol

结果:

bottleId   : PAGYCNQ6
timeToFind : 5
Determined : 11/25/2015 16:08
unitId     : =W03651532551000
status     : Negative

bottleId   : PAGYCNQ6
timeToFind : 5
Determined : 11/25/2015 16:08
unitId     : LW03651532551000
status     : Positive