正则表达式不适用于Word文档中的范围内查找

时间:2018-08-18 10:55:54

标签: regex powershell

正则表达式不起作用,要提取两个部分之间的内容(功能正常,但是可能没有提到正确的正则表达式进行查找)

ExtractFromWordDoc"D:\Scan.doc" '(?:\d{2}\.\d).*(?:Non-Payment)'  '(?:\d{2}\.\d).*(?:Financial covenants and other obligation)'

Word文档内容(需要提取29.1和29.2之间的信息)

29.1未付款

除非在以下情况下,债务人没有在到期日按表示应付款的地点和货币支付根据财务文件应支付的任何款项

(a)其付款失败是由于: (i)行政或技术错误;要么 (b)[付款在: (i)(对于以上(a)(i)款而言),其到期日的[]个工作日;

29.2财务契约和其他义务

(a)不满足第27条(财务契约)的任何要求[或债务人不符合第26条(信息承诺)的规定] [和/或第28条(一般承诺)]。 >

function ExtractFromWordDoc{
Param([string]$SourceFile, [string]$SearchKeyword1, [string]$SearchKeyword2)

$word = New-Object -ComObject Word.Application
$word.Visible = $false
$doc = $word.Documents.Open($SourceFile,$false,$true)
$sel = $word.Selection 
$paras = $doc.Paragraphs 
foreach ($para in $paras) 
{ 
    if ($para.Range.Text -match $SearchKeyword1)
    {
        $startPosition = $para.Range.Start
       }
    if ($para.Range.Text -match $SearchKeyword2)
    {
        $endPosition = $para.Range.Start
        break
    }
} 

[array]$content=New-Object System.Collections.ArrayList
$doc.Range($startPosition, $endPosition).Copy()
$content=Get-Clipboard -Raw
$content = $content -replace "'", ""

# cleanup com objects
$doc.Close()
$word.Quit()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($doc) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
}

2 个答案:

答案 0 :(得分:0)

您在Regex中只有一个小错误。

示例文本为Non-payment,但正则表达式匹配Non-Payment(区分大小写)

如果将'(?:\d{2}\.\d).*(?:Non-Payment)'更改为'(?:\d{2}\.\d).*(?:Non-payment)',则应该可以使用。

另一个要注意的是,您在s的{​​{1}}中缺少obligations,但是我不认为这会引起问题。

免责声明:我没有测试您的代码,只有您的正则表达式。

编辑:

我测试了以下内容

(?:\d{2}\.\d).*(?:Financial covenants and other obligation)

输出在剪贴板中是:

function ExtractFromWordDoc{
Param([string]$SourceFile, [string]$SearchKeyword1, [string]$SearchKeyword2)

$word = New-Object -ComObject Word.Application
$word.Visible = $false
$doc = $word.Documents.Open($SourceFile,$false,$true)
$sel = $word.Selection 
$paras = $doc.Paragraphs 
foreach ($para in $paras) 
{ 
    if ($para.Range.Text -match $SearchKeyword1)
    {
        #"Point 1"
        $startPosition = $para.Range.Start
       }
    if ($para.Range.Text -match $SearchKeyword2)
    {
        #"Point 2"
        $endPosition = $para.Range.Start
        break
    }
} 

[array]$content=New-Object System.Collections.ArrayList
$doc.Range($startPosition, $endPosition).Copy()
$content=Get-Clipboard -Raw
$content = $content -replace "'", ""

# cleanup com objects
$doc.Close()
$word.Quit()
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($doc) | Out-Null
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
}

ExtractFromWordDoc "C:\testing\test.doc" '(?:\d{2}\.\d).*(?:Non-payment)'  '(?:\d{2}\.\d).*(?:Financial covenants and other obligation)'

如果在函数末尾添加29.1 Non-payment An Obligor does not pay on the due date any amount payable pursuant to a Finance Document at the place at and in the currency in which it is expressed to be payable unless: (a) its failure to pay is caused by: (i) administrative or technical error; or (b) [payment is made within: (i) (in the case of paragraph (a)(i) above), [ ] Business Days of its due date; ,它将把此文本输出到控制台。

答案 1 :(得分:0)

要存储找到的文本,我建议使用StringBuilder类。

function ExtractFromWordDoc{
  Param([string]$SourceFile, [string]$SearchKeyword1, [string]$SearchKeyword2)

  $word = New-Object -ComObject Word.Application
  $word.Visible = $false
  $doc = $word.Documents.Open($SourceFile, $false, $true)

  $sb = New-Object System.Text.StringBuilder

  $text = $null

  foreach ($para in $doc.Paragraphs)
  {
    if ($para.Range.Text -match $SearchKeyword1)
    {
      while (1)
      {
        [void]$sb.AppendLine($foreach.current.Range.Text)
        if (-not $foreach.MoveNext())
        {
          break # we ran out of paragraphs
        }
        if ($foreach.current.Range.Text -match $SearchKeyword2)
        {
          $text = $sb.ToString() # the searched text was found
          break
        }
      }
      break
    }
  }

  $text # let's return something usefull

  # cleanup com objects
  $doc.Close()
  $save=$false
  $word.Quit([ref]$save)
  [System.Runtime.Interopservices.Marshal]::ReleaseComObject($doc) | Out-Null
  [System.Runtime.Interopservices.Marshal]::ReleaseComObject($word) | Out-Null
  [System.GC]::Collect()
  [System.GC]::WaitForPendingFinalizers()
}

ExtractFromWordDoc 'C:\testing\test.doc' '^\s*\d\d\.\d\s+Non-payment' '^\s*\d\d\.\d\s+Financial covenants and other obligation'