我怎样才能忽略在尝试重命名节点时创建无限循环的幻像xml属性?

时间:2013-04-26 19:53:41

标签: .net xml vb.net .net-4.0 xml-parsing

我的任务是将一个安静的Web服务的结果转换为一个带有新格式的XML文档。

要转换的html / xhtml的示例:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
    <head>
        <title>OvidWS Result Set Resource</title>
    </head>
    <body>
        <table id="results">
            <tr>
                <td class="_index">
                  <a class="uri" href="REDACTED">1</a>
                </td>
                <td class="au">
                  <span>GILLESPIE JB</span>
                  <span>KUKES RE</span>
                </td>
                <td class="so">A.M.A. American Journal of Diseases of Children</td>
                <td class="ti">Acetylsalicylic acid poisoning with recovery.</td>
                <td class="ui">20267726</td>
                <td class="yr">1947</td>
              </tr>
              <tr>
                <td class="_index">
                  <a class="uri" href="REDACTED">2</a>
                </td>
                <td class="au">BASS MH</td>
                <td class="so">Journal of the Mount Sinai Hospital, New York</td>
                <td class="ti">Aspirin poisoning in infants.</td>
                <td class="ui">20265054</td>
                <td class="yr">1947</td>
              </tr>
        </table>  
    </body>
</html>

理想情况下,我想要做的就是将列出的任何内容作为class属性并将其作为元素名称,如果没有“class”属性,我只想将其标记为项目。

这是我正在寻找的转换:

<results>
    <citation>
        <_index>
            <uri href="REDACTED">1</uri>
        </_index>
        <au>
            <item>GILLESPIE JB</item>
            <item>KUKES RE</item>
        </au>
        <so>A.M.A. American Journal of Diseases of Children</so>
        <ti>Acetylsalicylic acid poisoning with recovery.</ti>
        <ui>20267726</ui>
        <yr>1947</yr>
    </citation>
    <citation>
        <_index>
            <uri href="REDACTED">2</a>
        </_index>
        <au>BASS MH</au>
        <so>Journal of the Mount Sinai Hospital, New York</so>
        <ti>Aspirin poisoning in infants.</ti>
        <ui>20265054</ui>
        <yr>1947</yr>
    </citation>
</results>  

我找到了一小段代码here,它允许我重命名节点:

    Public Shared Function RenameNode(ByVal e As XmlNode, newName As String) As XmlNode
        Dim doc As XmlDocument = e.OwnerDocument
        Dim newNode As XmlNode = doc.CreateNode(e.NodeType, newName, Nothing)
        While (e.HasChildNodes)
            newNode.AppendChild(e.FirstChild)
        End While
        Dim ac As XmlAttributeCollection = e.Attributes
        While (ac.Count > 0) 
            newNode.Attributes.Append(ac(0))
        End While
        Dim parent As XmlNode = e.ParentNode
        parent.ReplaceChild(newNode, e)
        Return newNode
    End Function

但是在迭代XmlAttributeCollection时会出现问题。出于某种原因,当查看其中一个td节点时,2个未出现在源中的属性会神奇地出现:rowspan和colspan。看起来这些属性正在弄乱迭代器,因为当它们被消耗时,它们不会像'class'属性那样从属性列表中消失。而是消耗属性的值(从“1”变为“”)。这会导致无限循环。

我注意到它们属于'XMLUnspecifiedAttribute'类型,但是当我修改循环以检测到它时:

While (ac.Count > 0) And Not TypeOf (ac(0)) Is System.Xml.XmlUnspecifiedAttribute
    newNode.Attributes.Append(ac(0))
End While

我收到以下错误:

System.Xml.XmlUnspecifiedAttribute is not accessible in this context because it is 'friend'

为什么会发生这种情况或如何解决这个问题?

1 个答案:

答案 0 :(得分:2)

我认为您遇到的问题确实是您的文档类型声明。

因为你正在将节点完全翻译成其他东西然后我会说你甚至不需要它而且可以safely ignore it

由于我没有将它包含在我的测试中,然后当我将其包括在内时xmlresolver变得混乱,我假设你肯定不需要它。

您可以将解析器设置为nothing

来忽略它
{xml document object}.Xmlresolver = nothing

然后您选择节点和进程。我甚至在源文件中使用了doc类型,但仍然没有问题。

以下是我用来测试的代码:

Private Sub Form1_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
    Dim USEDoc As New XmlDocument

    Dim theNameManager As System.Xml.XmlNamespaceManager = New System.Xml.XmlNamespaceManager(USEDoc.NameTable)
    theNameManager.AddNamespace("xhtml", "http://www.w3.org/1999/xhtml")

    USEDoc.XmlResolver = Nothing
    USEDoc.Load("RestServ.txt")

    renameNodes(USEDoc.SelectSingleNode("descendant::xhtml:table", theNameManager))

    Dim SaveDoc As New XmlDocument
    SaveDoc.AppendChild(SaveDoc.ImportNode(USEDoc.SelectSingleNode("//results", theNameManager), True))

    SaveDoc.Save("RestServConv.xml")
End Sub

Public Function renameNodes(ByVal TopNode As XmlNode) As Boolean
    Dim UseNode As XmlNode

    If TopNode.Name <> "#text" Then
        If TopNode.Name = "tr" Then
            UseNode = RenameNode(TopNode, "citation")
        ElseIf TopNode.Name = "table" Then
            UseNode = RenameNode(TopNode, "results")
            UseNode.Attributes.RemoveNamedItem("id")
        ElseIf TopNode.Attributes.Count > 0 Then
            For Each oAttribute As XmlAttribute In TopNode.Attributes
                If oAttribute.Name = "class" Then
                    UseNode = RenameNode(TopNode, oAttribute.Value)
                    UseNode.Attributes.RemoveNamedItem("class")
                    Exit For
                End If
            Next oAttribute
        End If

        If UseNode IsNot Nothing Then
            If UseNode.ChildNodes.Count > 0 Then
                Dim x As Integer
                For x = 0 To UseNode.ChildNodes.Count - 1
                    renameNodes(UseNode.ChildNodes(x))
                Next x
            End If
        End If
    End If

    Return True
End Function

Public Shared Function RenameNode(ByVal e As XmlNode, ByVal newName As String) As XmlNode
    Dim doc As XmlDocument = e.OwnerDocument
    Dim newNode As XmlNode = doc.CreateNode(e.NodeType, newName, Nothing)
    While (e.HasChildNodes)
        newNode.AppendChild(e.FirstChild)
    End While
    Dim ac As XmlAttributeCollection = e.Attributes
    While (ac.Count > 0)
        newNode.Attributes.Append(ac(0))
    End While
    Dim parent As XmlNode = e.ParentNode
    parent.ReplaceChild(newNode, e)
    Return newNode
End Function

我传入了你的示例文档,我得到的结果是:

<results>
  <citation>
    <_index>
      <uri href="REDACTED">1</uri>
    </_index>
    <au>
      <span xmlns="http://www.w3.org/1999/xhtml">GILLESPIE JB</span>
      <span xmlns="http://www.w3.org/1999/xhtml">KUKES RE</span>
    </au>
    <so rowspan="1" colspan="1">A.M.A. American Journal of Diseases of Children</so>
    <ti>Acetylsalicylic acid poisoning with recovery.</ti>
    <ui>20267726</ui>
    <yr>1947</yr>
  </citation>
  <citation>
    <_index>
      <uri href="REDACTED">2</uri>
    </_index>
    <au>BASS MH</au>
    <so>Journal of the Mount Sinai Hospital, New York</so>
    <ti>Aspirin poisoning in infants.</ti>
    <ui>20265054</ui>
    <yr>1947</yr>
  </citation>
</results>