Question

我有这个PowerShell脚本，其主要目的是搜索文件夹中的HTML文件，找到特定的HTML标记，并替换为我告诉它的内容。

我能够完成3/4的发现并完全取代。我遇到麻烦的是涉及正则表达式。

这是我试图让我的正则表达式找到并替换的标记：

<a href="programsactivities_skating.html"><br />
                                           </a>

这是我到目前为止的正则表达式，以及我在其中使用的函数：

automate -school "C:\Users\$env:username\Desktop\schools\$question" -query '(?mis)(?!exclude1|exclude2|exclude3)(<a[^>]*?>(\s|&nbsp;|<br\s?/?>)*</a>)' -replace ''

这是自动化功能：

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = Get-Content $file
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}

我一直试图找出解决方案大约2天，而且似乎无法让它发挥作用。我已经确定问题是我需要告诉我的正则表达式考虑Multiline，这就是我遇到的麻烦。

任何人都能提供的帮助非常感谢。

先谢谢。

Answer 1

Get-Content生成一个字符串数组，其中每个字符串包含输入文件中的一行，因此您将无法匹配跨越多行的文本段落。如果希望能够匹配多行，则需要将数组合并为单个字符串：

$text = Get-Content $file | Out-String

或

[String]$text = Get-Content $file

或

$text = [IO.File]::ReadAllText($file)

请注意，1 ^st和2 ^nd方法不会保留输入文件中的换行符。方法2简单地修改了所有换行符，正如Keith在注释中指出的那样，方法1在加入数组时将<CR><LF>放在每一行的末尾。在处理Linux / Unix或Mac文件时，后者可能是一个问题。

Answer 2

我不知道你试图用那些Exclude元素做什么，但我发现多行正则表达式通常更容易构造在here-string中：

$text = @'
<a href="programsactivities_skating.html"><br />
                                       </a>
'@

$regex = @'
(?mis)<a href="programsactivities_skating.html"><br />
\s+?</a>
'@

$text -match $regex

True

Answer 3

Get-Content将返回一个字符串数组，您想要连接有问题的字符串以创建一个字符串：

function automate($school, $query, $replace) {
    $processFiles = Get-ChildItem -Exclude *.bak -Include "*.html", "*.HTML", "*.htm", "*.HTM" -Recurse -Path $school
    foreach ($file in  $processFiles) {
        $text = ""
        $text = Get-Content $file | % { $text += $_ +"`r`n" }
        $text = $text -replace $query, $replace
        $text | Out-File $file -Force -Encoding utf8
    }
}

PowerShell中的多行正则表达式

3 个答案: