优化Word文档关键字搜索

时间:2015-10-22 22:32:05

标签: powershell ms-word

我尝试在大量MS Word文档中搜索关键字,并将结果返回到文件中。我有一个工作脚本,但我不知道规模,而且我所得到的效率还不够高,所以需要几天的时间才能完成所有工作。

现在的脚本从CompareData.txt获取关键字,并将其运行到特定文件夹中的所有文件,然后将其附加到文件中。

所以当我完成后,我会知道每个特定关键字有多少个文件。

[cmdletBinding()] 
Param( 
$Path = "C:\willscratch\" 
) #end param 
$findTexts = (Get-Content c:\scratch\CompareData.txt)
Foreach ($Findtext in $FindTexts)
{
$matchCase = $false 
$matchWholeWord = $true 
$matchWildCards = $false 
$matchSoundsLike = $false 
$matchAllWordForms = $false 
$forward = $true 
$wrap = 1 
$application = New-Object -comobject word.application 
$application.visible = $False 
$docs = Get-childitem -path $Path -Recurse -Include *.docx  
$i = 1 
$totaldocs = 0 
Foreach ($doc in $docs) 
{ 
Write-Progress -Activity "Processing files" -status "Processing $($doc.FullName)" -PercentComplete ($i /$docs.Count * 100) 
$document = $application.documents.open($doc.FullName) 
$range = $document.content 
$null = $range.movestart() 
$wordFound = $range.find.execute($findText,$matchCase, 
  $matchWholeWord,$matchWildCards,$matchSoundsLike, 
  $matchAllWordForms,$forward,$wrap) 
  if($wordFound) 
    { 
     $doc.fullname 
     $document.Words.count 
     $totaldocs ++ 
  } #end if $wordFound 
$document.close() 
$i++ 
} #end foreach $doc 
$application.quit() 
"There are $totaldocs total files with $findText"  | Out-File -Append C:\scratch\output.txt

#clean up stuff 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($range) | Out-Null 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($document) | Out-Null 
[System.Runtime.InteropServices.Marshal]::ReleaseComObject($application) | Out-Null 
Remove-Variable -Name application 
[gc]::collect() 
[gc]::WaitForPendingFinalizers() 
}

我想要做的是找出一种方法,在CompareData.txt中搜索每个文件一次,而不是多次迭代它。如果我正在处理一小组数据,我所获得的方法就可以完成工作 - 但是我发现CompareData.txt和源Word文件目录中的数据都将非常大。

关于如何优化这个的任何想法?

1 个答案:

答案 0 :(得分:2)

现在你正在做这个(伪代码):

foreach $Keyword {
    create Word Application
    foreach $File {
        load Word Document from $File
        find $Keyword
    }
}

这意味着,如果您有100个关键字和10个文档,则可以打开和关闭 100个Word实例并加载千字文档在你完成之前。

请改为:

create Word Application
foreach $File {
    load Word Document from $File
    foreach $Keyword {
        find $Keyword
    }
}

因此,您只启动一个Word实例,并且只加载一次文档。

作为noted in the comments,您可以使用OpenXML SDK优化整个过程,而不是启动Word:

(假设您已在其默认位置安装了OpenXML SDK)

# Import the OpenXML library
Add-Type -Path 'C:\Program Files (x86)\Open XML SDK\V2.5\lib\DocumentFormat.OpenXml.dll'

# Grab the keywords and file names    
$Keywords  = Get-Content C:\scratch\CompareData.txt
$Documents = Get-childitem -path $Path -Recurse -Include *.docx  

# hashtable to store results per document
$KeywordMatches = @{}

# store OpenXML word document type in variable as a shorthand
$WordDoc = [DocumentFormat.OpenXml.Packaging.WordprocessingDocument] -as [type]

foreach($Docx in $Docs)
{
    # create array to hold matched keywords
    $KeywordMatches[$Docx.FullName] = @()

    # open document, wrap content stream in streamreader 
    $Document       = $WordDoc::Open($Docx.FullName, $false)
    $DocumentStream = $Document.MainDocumentPart.GetStream()
    $DocumentReader = New-Object System.IO.StreamReader $DocumentStream

    # read entire document
    $DocumentContent = $DocumentReader.ReadToEnd()

    # test for each keyword
    foreach($Keyword in $Keywords)
    {
        $Pattern   = [regex]::Escape($KeyWord)
        $WordFound = $DocumentContent -match $Pattern
        if($WordFound)
        {
            $KeywordMatches[$Docx.FullName] += $Keyword
        }
    }

    $DocumentReader.Dispose()
    $Document.Dispose()
}

现在,您可以显示每个文档的字数:

$KeywordMatches.GetEnumerator() |Select File,@{n="Count";E={$_.Value.Count}}