XML替换无效字符(允许列表)

时间:2016-11-23 11:45:00

标签: xml list powershell character

我有一个输入XML文件(来自另一台服务器),其中包含一个<Notes>节点,该节点包含所有用户输入的注释。作为自由格式文本,它可以包含各种奇怪的字符。我想做的是删除除了允许列表

之外的所有字符
  • a-z(小写)
  • A-Z(大写)
  • 其他普通字符,例如(空格,_,!?。/;:&lt;&gt;(){} / []&#34;&#39;)

源XML看起来像这样:

<?xml version="1.0" encoding="UTF-8"?>
<OrganisationUnits>
  <OrganisationUnitsRow num="1">
    <OrganisationId>PROV1</OrganisationId>
    <OrganisationName>PROVIDER 1</OrganisationName>
    <Addresses>
    </Addresses>
    <ContactDetails>
      <ContactDetailsRow num="1">
        <Notes>0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</Notes>
      </ContactDetailsRow>
    </ContactDetails>
  </OrganisationUnitsRow>
  <OrganisationUnitsRow num="2">
    <OrganisationId>PROV2</OrganisationId>
    <OrganisationName>PROVIDER 2</OrganisationName>
    <Addresses>
      <AddressesRow num="1">
        <AddressId>G72516</AddressId>
      </AddressesRow>
    </Addresses>
    <ContactDetails>
      <ContactDetailsRow num="1">
        <Notes>0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</Notes>
      </ContactDetailsRow>
    </ContactDetails>
  </OrganisationUnitsRow>
</OrganisationUnits>

我的代码是找到XML文档的正确部分,但它没有执行替换。

$xml = [xml](Get-Content $path\$xml_out)
$AcceptedChars = '[^a-z0-9 _,!?./;:<>(){}\[\]''\-\\"]'
$ReplaceChars = ""

foreach ($ContactDetailsRow in $xml.OrganisationUnits.OrganisationUnitsRow.ContactDetails) {
    $n = $ContactDetailsRow.Item('Notes')
    Write-Host "found the notes node `$n"
    #check if the provider has got a notes node
    if ($n) {
        #do the search/replace
        if ($n -NotMatch $AcceptedChars) {
            $n -replace $AcceptedChars, $ReplaceChars
        }
    }
}
$xml.Save("$path\$xml_out")

我试图使用正则表达式替换来完成工作,但它并没有发生在我身上。

根据Ansgar的评论更改了代码,并将AllowedChars简化为数字,以表明它不起作用:

$path = "C:\TEST"
$xml = [xml](Get-Content $path\source.xml)
$xmlout = "$path\new.xml"

$AcceptedChars = "a-z"
$InvalidChars  = '[^{0}]' -f [regex]::Escape($AcceptedChars)
$ReplaceChar   = '_'

foreach ($ContactDetailsRow in $xml.OrganisationUnits.OrganisationUnitsRow.ContactDetails) {
    $n = $ContactDetailsRow.Item('Notes')

    if ($n.'#text' -match $InvalidChars) {
        $n.'#text' = $n.'#text' -replace $InvalidChars, $ReplaceChar
    } else {
        #There is no match on the accepted characters
        echo "NO MATCH - we have not made a match on invalid characters"
    }
}
$xml.Save($xmlout)

1 个答案:

答案 0 :(得分:1)

你的逻辑被打破了。您将变量命名为$AcceptedChars并将其用作包含已接受字符(-notmatch $AcceptedChars)的变量,但它实际上定义为不接受的字符([^...])。字符类开头的插入符号反转类(匹配除了这些字符之外的任何内容)。此外,您需要检查节点的#text属性,而不是节点对象本身:

$AcceptedChars = "a-z0-9 _,!?./;:<>(){}[]'`"-\"
$InvalidChars  = '[^{0}]' -f [regex]::Escape($AcceptedChars)
$ReplaceChar   = '_'   # it's usually safer to replace invalid characters with
                       # a safe character rather than flat out remove them
...
if ($n.'#text' -match $InvalidChars) {
    $n.'#text' = $n.'#text' -replace $InvalidChars, $ReplaceChar
}

话虽如此,您问题中的XML样本甚至无法解析为XML,因为它包含未编码的&符号。因此,您的代码将在第一行中失败:

$xml = [xml](Get-Content $path\$xml_out)

出现如下错误:

  

无法转换值&#34; System.Object []&#34;输入&#34; System.Xml.XmlDocument&#34;。错误:&#34;解析EntityName时发生错误。第15行,第31位。&#34;

所以你可能需要先做这样的事情:

$xmlfile = Join-Path $path $xml_out
(Get-Content $xmlfile) -replace '&', '&amp;' | Set-Content $xmlfile

或者像这样:

$xmlfile = Join-Path $path $xml_out
[xml]$xml = (Get-Content $xmlfile) -replace '&', '&amp;'