我已经起草了一个PowerShell脚本,用于在大量Word文件中搜索字符串。该脚本运行正常,但我有大约1 GB的数据需要搜索,大约需要15分钟。
有人可以建议我做些什么来让它跑得更快吗?
Set-StrictMode -Version latest
$path = "c:\Tester1"
$output = "c:\Scripts\ResultMatch1.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "Roaming"
$charactersAround = 30
$results = @()
Function getStringMatch
{
For ($i=1; $i -le 4; $i++) {
$j="D"+$i
$finalpath=$path+"\"+$j
$files = Get-Childitem $finalpath -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = @{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += New-Object -TypeName PsCustomObject -Property $properties
$document.close()
}
}
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
$application.quit()
}
getStringMatch
import-csv $output
答案 0 :(得分:0)
正如评论中所提到的,您可能需要考虑使用OpenXML SDK库(您也可以在GitHub上获得最新版本的SDK),因为它比启动Word实例的开销更少。< / p>
下面我将您当前的函数转换为更通用的函数,使用SDK并且不依赖于调用者/父作用域:
function Get-WordStringMatch
{
param(
[Parameter(Mandatory,ValueFromPipeline)]
[System.IO.FileInfo[]]$Files,
[string]$FindText,
[int]$CharactersAround
)
begin {
# import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll' |Out-Null
# make a "shorthand" reference to the word document type
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]
# construct the regex pattern
$Pattern = ".{$CharactersAround}$([regex]::Escape($FindText)).{$CharactersAround}"
}
process {
# loop through all the *.doc(x) files
foreach ($File In $Files)
{
# open document, wrap content stream in streamreader
$Document = $WordDoc::Open($File.FullName, $false)
$DocumentStream = $Document.MainDocumentPart.GetStream()
$DocumentReader = New-Object System.IO.StreamReader $DocumentStream
# read entire document
if($DocumentReader.ReadToEnd() -match $Pattern)
{
# got a match? output our custom object
New-Object psobject -Property @{
File = $File.FullName
Match = $FindText
TextAround = $Matches[0]
}
}
}
}
end{
# Clean up
$DocumentReader.Dispose()
$DocumentStream.Dispose()
$Document.Dispose()
}
}
既然你有一个很好的功能支持管道输入,你需要做的就是收集你的文件并将它们传递给它!
# variables
$path = "c:\Tester1"
$output = "c:\Scripts\ResultMatch1.csv"
$findtext = "Roaming"
$charactersAround = 30
# gather the files
$files = 1..4|ForEach-Object {
$finalpath = Join-Path $path "D$i"
Get-Childitem $finalpath -Recurse | Where-Object { !($_.PsIsContainer) -and @('*.docx','*.doc' -contains $_.Extension)}
}
# run them through our new function
$results = $files |Get-WordStringMatch -FindText $findtext -CharactersAround $charactersAround
# got any results? export it all to CSV
if($results){
$results |Export-Csv -Path $output -NoTypeInformation
}
由于我们所有的组件现在都支持流水线操作,因此您可以一次性完成所有操作:
1..4|ForEach-Object {
$finalpath = Join-Path $path "D$i"
Get-Childitem $finalpath -Recurse | Where-Object { !($_.PsIsContainer) -and @('*.docx','*.doc' -contains $_.Extension)}
} |Get-WordStringMatch -FindText $findtext -CharactersAround $charactersAround |Export-Csv -Path $output -NoTypeInformation