获取Html页面的可用XPath?

时间:2014-08-19 02:44:42

标签: html .net xml vb.net xpath

我已经采用并修改了this代码,了解如何检索XML文档的XPath表达式。

我想做同样的事情,但是使用html页面检索其可用的XPath(可能是HtmlDocument?),这可能吗?

注意:我可以接受原生解决方案,也可以使用HtmlAgilityPack库。

这是XML方法:

''' <summary>
''' Gets all the XPath expressions of an XML Document.
''' </summary>
''' <param name="Document">Indicates the XML document.</param>
''' <returns>List(Of System.String).</returns>
Public Function GetXPaths(ByVal Document As Xml.XmlDocument) As List(Of String)

    Dim XPathList As New List(Of String)

    Dim XPath As String = String.Empty

    For Each Child As Xml.XmlNode In Document.ChildNodes

        If Child.NodeType = Xml.XmlNodeType.Element Then
            GetXPaths(Child, XPathList, XPath)
        End If

    Next ' child

    Return XPathList

End Function

''' <summary>
''' Gets all the XPath expressions of an XML Node.
''' </summary>
''' <param name="Node">Indicates the XML node.</param>
''' <param name="XPathList">Indicates a ByReffered XPath list as a <see cref="List(Of String)"/>.</param>
''' <param name="XPath">Indicates the current XPath.</param>
Private Sub GetXPaths(ByVal Node As Xml.XmlNode,
                      ByRef XPathList As List(Of String),
                      Optional ByVal XPath As String = Nothing)

    XPath &= "/" & Node.Name

    If Not XPathList.Contains(XPath) Then
        XPathList.Add(XPath)
    End If

    For Each Child As Xml.XmlNode In Node.ChildNodes

        If Child.NodeType = Xml.XmlNodeType.Element Then
            GetXPaths(Child, XPathList, XPath)
        End If

    Next ' child

End Sub

1 个答案:

答案 0 :(得分:1)

据我所知,HtmlAgilityPack与XmlDocument的类结构非常相似。因此,我相信您可以轻松调整当前的解决方案以应对HtmlDocument,如下所示:

Public Function GetXPaths(ByVal Document As HtmlDocument) As List(Of String)
    Dim XPathList As New List(Of String)
    Dim XPath As String = String.Empty
    For Each Child As HtmlNode In Document.DocumentNode.ChildNodes
        If Child.NodeType = HtmlNodeType.Element Then
            GetXPaths(Child, XPathList, XPath)
        End If
    Next ' child'
    Return XPathList
End Function

Private Sub GetXPaths(ByVal Node As HtmlNode,
                  ByRef XPathList As List(Of String),
                  Optional ByVal XPath As String = Nothing)
    XPath &= "/" & Node.Name
    If Not XPathList.Contains(XPath) Then
        XPathList.Add(XPath)
    End If
    For Each Child As HtmlNode In Node.ChildNodes
        If Child.NodeType = HtmlNodeType.Element Then
            GetXPaths(Child, XPathList, XPath)
        End If
    Next ' child'
End Sub

使用符合XML的HTML进行测试时工作正常。但我不能保证这对于格式错误的HTML文档有多大作用。