我尝试在大量MS Word文档中搜索关键字,并将结果返回到文件中。我有一个工作脚本,但我不知道规模,而且我所得到的效率还不够高,所以需要几天的时间才能完成所有工作。
现在的脚本从CompareData.txt获取关键字,并将其运行到特定文件夹中的所有文件,然后将其附加到文件中。
所以当我完成后,我会知道每个特定关键字有多少个文件。
[cmdletBinding()]
Param(
$Path = "C:\willscratch\"
) #end param
$findTexts = (Get-Content c:\scratch\CompareData.txt)
Foreach ($Findtext in $FindTexts)
{
$matchCase = $false
$matchWholeWord = $true
$matchWildCards = $false
$matchSoundsLike = $false
$matchAllWordForms = $false
$forward = $true
$wrap = 1
$application = New-Object -comobject word.application
$application.visible = $False
$docs = Get-childitem -path $Path -Recurse -Include *.docx
$i = 1
$totaldocs = 0
Foreach ($doc in $docs)
{
Write-Progress -Activity "Processing files" -status "Processing $($doc.FullName)" -PercentComplete ($i /$docs.Count * 100)
$document = $application.documents.open($doc.FullName)
$range = $document.content
$null = $range.movestart()
$wordFound = $range.find.execute($findText,$matchCase,
$matchWholeWord,$matchWildCards,$matchSoundsLike,
$matchAllWordForms,$forward,$wrap)
if($wordFound)
{
$doc.fullname
$document.Words.count
$totaldocs ++
} #end if $wordFound
$document.close()
$i++
} #end foreach $doc
$application.quit()
"There are $totaldocs total files with $findText" | Out-File -Append C:\scratch\output.txt
#clean up stuff
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($range) | Out-Null
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($document) | Out-Null
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null
Remove-Variable -Name application
[gc]::collect()
[gc]::WaitForPendingFinalizers()
}
我想要做的是找出一种方法,在CompareData.txt中搜索每个文件一次,而不是多次迭代它。如果我正在处理一小组数据,我所获得的方法就可以完成工作 - 但是我发现CompareData.txt和源Word文件目录中的数据都将非常大。
关于如何优化这个的任何想法?
答案 0 :(得分:2)
现在你正在做这个(伪代码):
foreach $Keyword {
create Word Application
foreach $File {
load Word Document from $File
find $Keyword
}
}
这意味着,如果您有100个关键字和10个文档,则可以打开和关闭 100个Word实例并加载千字文档在你完成之前。
请改为:
create Word Application
foreach $File {
load Word Document from $File
foreach $Keyword {
find $Keyword
}
}
因此,您只启动一个Word实例,并且只加载一次文档。
作为noted in the comments,您可以使用OpenXML SDK优化整个过程,而不是启动Word:
(假设您已在其默认位置安装了OpenXML SDK)
# Import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll'
# Grab the keywords and file names
$Keywords = Get-Content C:\scratch\CompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx
# hashtable to store results per document
$KeywordMatches = @{}
# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]
foreach($Docx in $Docs)
{
# create array to hold matched keywords
$KeywordMatches[$Docx.FullName] = @()
# open document, wrap content stream in streamreader
$Document = $WordDoc::Open($Docx.FullName, $false)
$DocumentStream = $Document.MainDocumentPart.GetStream()
$DocumentReader = New-Object System.IO.StreamReader $DocumentStream
# read entire document
$DocumentContent = $DocumentReader.ReadToEnd()
# test for each keyword
foreach($Keyword in $Keywords)
{
$Pattern = [regex]::Escape($KeyWord)
$WordFound = $DocumentContent -match $Pattern
if($WordFound)
{
$KeywordMatches[$Docx.FullName] += $Keyword
}
}
$DocumentReader.Dispose()
$Document.Dispose()
}
现在,您可以显示每个文档的字数:
$KeywordMatches.GetEnumerator() |Select File,@{n="Count";E={$_.Value.Count}}