在多个word文件中搜索字符串时提高性能

时间:2017-04-07 10:47:50

标签: powershell string-search

我已经起草了一个PowerShell脚本,用于在大量Word文件中搜索字符串。该脚本运行正常,但我有大约1 GB的数据需要搜索,大约需要15分钟。

有人可以建议我做些什么来让它跑得更快吗?

Set-StrictMode -Version latest
$path     = "c:\Tester1"
$output   = "c:\Scripts\ResultMatch1.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "Roaming"
$charactersAround = 30
$results = @()

Function getStringMatch
{

For ($i=1; $i -le 4; $i++) {
$j="D"+$i 
$finalpath=$path+"\"+$j
$files    = Get-Childitem $finalpath -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }    
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
    $document = $application.documents.open($file.FullName,$false,$true)
    $range = $document.content

    If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
         $properties = @{
            File = $file.FullName
            Match = $findtext
            TextAround = $Matches[0] 
         }
         $results += New-Object -TypeName PsCustomObject -Property $properties
       $document.close()  
    }


}

}


If($results){
    $results | Export-Csv $output -NoTypeInformation
}

$application.quit()

}

getStringMatch

import-csv $output

1 个答案:

答案 0 :(得分:0)

正如评论中所提到的,您可能需要考虑使用OpenXML SDK库(您也可以在GitHub上获得最新版本的SDK),因为它比启动Word实例的开销更少。< / p>

下面我将您当前的函数转换为更通用的函数,使用SDK并且不依赖于调用者/父作用域:

function Get-WordStringMatch
{
    param(
        [Parameter(Mandatory,ValueFromPipeline)]
        [System.IO.FileInfo[]]$Files,
        [string]$FindText,
        [int]$CharactersAround
    )

    begin {
        # import the OpenXML library
        Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll' |Out-Null

        # make a "shorthand" reference to the word document type
        $WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]

        # construct the regex pattern
        $Pattern = ".{$CharactersAround}$([regex]::Escape($FindText)).{$CharactersAround}"
    }

    process {
        # loop through all the *.doc(x) files
        foreach ($File In $Files)
        {
            # open document, wrap content stream in streamreader 
            $Document       = $WordDoc::Open($File.FullName, $false)
            $DocumentStream = $Document.MainDocumentPart.GetStream()
            $DocumentReader = New-Object System.IO.StreamReader $DocumentStream

            # read entire document
            if($DocumentReader.ReadToEnd() -match $Pattern)
            {
                # got a match? output our custom object
                New-Object psobject -Property @{
                    File = $File.FullName
                    Match = $FindText
                    TextAround = $Matches[0] 
                }
            }
        }
    }

    end{
        # Clean up
        $DocumentReader.Dispose()
        $DocumentStream.Dispose()
        $Document.Dispose()
    }
}

既然你有一个很好的功能支持管道输入,你需要做的就是收集你的文件并将它们传递给它!

# variables
$path     = "c:\Tester1"
$output   = "c:\Scripts\ResultMatch1.csv"
$findtext = "Roaming"
$charactersAround = 30

# gather the files
$files = 1..4|ForEach-Object {
    $finalpath = Join-Path $path "D$i"
    Get-Childitem $finalpath -Recurse | Where-Object { !($_.PsIsContainer) -and @('*.docx','*.doc' -contains $_.Extension)}
}

# run them through our new function
$results = $files |Get-WordStringMatch -FindText $findtext -CharactersAround $charactersAround

# got any results? export it all to CSV
if($results){
    $results |Export-Csv -Path $output -NoTypeInformation
}

由于我们所有的组件现在都支持流水线操作,因此您可以一次性完成所有操作:

1..4|ForEach-Object {
    $finalpath = Join-Path $path "D$i"
    Get-Childitem $finalpath -Recurse | Where-Object { !($_.PsIsContainer) -and @('*.docx','*.doc' -contains $_.Extension)}
} |Get-WordStringMatch -FindText $findtext -CharactersAround $charactersAround |Export-Csv -Path $output -NoTypeInformation