我正在搜索非常大量的Word文档(5000)以获取大量字符串(3000)。我知道如何在Powershell脚本中执行此操作,但这需要很长时间。幸运的是,大多数这些字符串在前3或4个字符中都有通用文本,如果在find.execute语句中使用通配符搜索,我可以将字符串缩小到大约300。如果我在strings.txt中搜索(cod)*,它会找到诸如" code,"之类的结果。 "编码","编码"等在Word文档中,我需要将这些结果放入文本文件中。但是,我没有太多运气。
$filePath = "C:\files\"
$textPath = "C:\strings.txt"
$outputPath = "C:\output.txt"
$findTexts = (Get-Content $textPath)
$docs = Get-childitem -path $filePath -Recurse -Include *.docx
$application = New-Object -comobject word.application
Foreach ($doc in $docs)
{
$document = $application.documents.open("$doc", $false, $true)
$application.visible = $False
$matchCase = $false
$matchWholeWord = $false
$matchWildCards = $true
$matchSoundsLike = $false
$matchAllWordForms = $false
$forward = $true
$wrap = 1
$range = $document.content
$null = $range.movestart()
Foreach ($findtext in $findTexts)
{
$wordFound = $range.find.execute($findText,$matchCase,$matchWholeWord,$matchWildCards,$matchSoundsLike, $matchAllWordForms,$forward,$wrap)
if ($wordFound)
{
$docName = $doc.Name
#Output search results and file name to a tab-delimited file
"$findText`t$docName" | Out-File -append $outputPath
} #end if $wordFound
} #end foreach $findText
$document.close()
} #end foreach $doc
$application.quit()
如果我的Word文档带有"编码"在其中,此脚本导致output.txt包含(cod)*通配符和文件名,因为$ findText =(cod)*。那么有没有办法得到“#34;编码"输出到文件?
答案 0 :(得分:1)
为什么不在文档中的所有文本上使用Powershell正则表达式,而不是使用Word的通配符搜索。像这样:
if ($document.Content.Text -match "\b$($findText)\w+\b")
{
$docName = $doc.Name
"$($matches[0])`t$docName" | Out-File -append $outputPath
}