我有一个输入XML文件(来自另一台服务器),其中包含一个<Notes>
节点,该节点包含所有用户输入的注释。作为自由格式文本,它可以包含各种奇怪的字符。我想做的是删除除了允许列表
源XML看起来像这样:
<?xml version="1.0" encoding="UTF-8"?>
<OrganisationUnits>
<OrganisationUnitsRow num="1">
<OrganisationId>PROV1</OrganisationId>
<OrganisationName>PROVIDER 1</OrganisationName>
<Addresses>
</Addresses>
<ContactDetails>
<ContactDetailsRow num="1">
<Notes>0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</Notes>
</ContactDetailsRow>
</ContactDetails>
</OrganisationUnitsRow>
<OrganisationUnitsRow num="2">
<OrganisationId>PROV2</OrganisationId>
<OrganisationName>PROVIDER 2</OrganisationName>
<Addresses>
<AddressesRow num="1">
<AddressId>G72516</AddressId>
</AddressesRow>
</Addresses>
<ContactDetails>
<ContactDetailsRow num="1">
<Notes>0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</Notes>
</ContactDetailsRow>
</ContactDetails>
</OrganisationUnitsRow>
</OrganisationUnits>
我的代码是找到XML文档的正确部分,但它没有执行替换。
$xml = [xml](Get-Content $path\$xml_out)
$AcceptedChars = '[^a-z0-9 _,!?./;:<>(){}\[\]''\-\\"]'
$ReplaceChars = ""
foreach ($ContactDetailsRow in $xml.OrganisationUnits.OrganisationUnitsRow.ContactDetails) {
$n = $ContactDetailsRow.Item('Notes')
Write-Host "found the notes node `$n"
#check if the provider has got a notes node
if ($n) {
#do the search/replace
if ($n -NotMatch $AcceptedChars) {
$n -replace $AcceptedChars, $ReplaceChars
}
}
}
$xml.Save("$path\$xml_out")
我试图使用正则表达式替换来完成工作,但它并没有发生在我身上。
根据Ansgar的评论更改了代码,并将AllowedChars简化为数字,以表明它不起作用:
$path = "C:\TEST"
$xml = [xml](Get-Content $path\source.xml)
$xmlout = "$path\new.xml"
$AcceptedChars = "a-z"
$InvalidChars = '[^{0}]' -f [regex]::Escape($AcceptedChars)
$ReplaceChar = '_'
foreach ($ContactDetailsRow in $xml.OrganisationUnits.OrganisationUnitsRow.ContactDetails) {
$n = $ContactDetailsRow.Item('Notes')
if ($n.'#text' -match $InvalidChars) {
$n.'#text' = $n.'#text' -replace $InvalidChars, $ReplaceChar
} else {
#There is no match on the accepted characters
echo "NO MATCH - we have not made a match on invalid characters"
}
}
$xml.Save($xmlout)
答案 0 :(得分:1)
你的逻辑被打破了。您将变量命名为$AcceptedChars
并将其用作包含已接受字符(-notmatch $AcceptedChars
)的变量,但它实际上定义为不接受的字符([^...]
)。字符类开头的插入符号反转类(匹配除了这些字符之外的任何内容)。此外,您需要检查节点的#text
属性,而不是节点对象本身:
$AcceptedChars = "a-z0-9 _,!?./;:<>(){}[]'`"-\"
$InvalidChars = '[^{0}]' -f [regex]::Escape($AcceptedChars)
$ReplaceChar = '_' # it's usually safer to replace invalid characters with
# a safe character rather than flat out remove them
...
if ($n.'#text' -match $InvalidChars) {
$n.'#text' = $n.'#text' -replace $InvalidChars, $ReplaceChar
}
话虽如此,您问题中的XML样本甚至无法解析为XML,因为它包含未编码的&符号。因此,您的代码将在第一行中失败:
$xml = [xml](Get-Content $path\$xml_out)
出现如下错误:
无法转换值&#34; System.Object []&#34;输入&#34; System.Xml.XmlDocument&#34;。错误:&#34;解析EntityName时发生错误。第15行,第31位。&#34;
所以你可能需要先做这样的事情:
$xmlfile = Join-Path $path $xml_out
(Get-Content $xmlfile) -replace '&', '&' | Set-Content $xmlfile
或者像这样:
$xmlfile = Join-Path $path $xml_out
[xml]$xml = (Get-Content $xmlfile) -replace '&', '&'