我正在使用一个函数来使用HtmlAgilityPack库从HTML中获取所有可用的XPath表达式。
问题是我得到了这种格式的表达式:
/html[1]/body[1]/div[1]/div[1]/div[1]/div[1]/h4[1]/a[1]
我会改进它以获取节点/元素的名称,如下所示:
/html/body/div[@class='infolinks']/div[@class='music']/div[@class='item']/div[@class='release']/h4[1]/a[@title]
但我不知道如何使用 HtmlAgilityPack 正确获取他们的名字。
我怎么做?。
注意:我不是任何XPath专家,如果XPath的语法错误或者我不理解事情,那就不好意思。
我正在尝试的网页源代码:
<div class="infolinks"><input type="hidden" name="IL_IN_TAG" value="1"/></div><div id="main">
<div class="music">
<h2 class="boxtitle">New releases \ <small>
<a href="/newalbums" title="New releases mp3 downloads" rel="bookmark">see all</a></small>
</h2>
<div class="item">
<div class="thumb">
<a href="http://www.mp3crank.com/curt-smith/deceptively-heavy-121861" rel="bookmark" lang="en" title="Curt Smith - Deceptively Heavy album downloads"><img width="100" height="100" alt="Mp3 downloads Curt Smith - Deceptively Heavy" title="Free mp3 downloads Curt Smith - Deceptively Heavy" src="http://www.mp3crank.com/cover-album/Curt-Smith-Deceptively-Heavy-400x400.jpg"/></a>
</div>
<div class="release">
<h3>Curt Smith</h3>
<h4>
<a href="http://www.mp3crank.com/curt-smith/deceptively-heavy-121861" title="Mp3 downloads Curt Smith - Deceptively Heavy">Deceptively Heavy</a>
</h4>
<script src="/ads/button.js"></script>
</div>
<div class="release-year">
<p>Year</p>
<span>2013</span>
</div>
<div class="genre">
<p>Genre</p>
<a href="http://www.mp3crank.com/genre/indie" rel="tag">Indie</a><a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
</div>
</div>
<div class="item">
<div class="thumb">
<a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads"><img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
</div>
<div class="release">
<h3>Wolf Eyes</h3>
<h4>
<a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" title="Mp3 downloads Wolf Eyes - Lower Demos">Lower Demos</a>
</h4>
<script src="/ads/button.js"></script>
</div>
<div class="release-year">
<p>Year</p>
<span>2013</span>
</div>
<div class="genre">
<p>Genre</p>
<a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
</div>
</div>
</div>
</div>
获取XPath的功能:
Public Function GetXPaths(ByVal Document As HtmlAgilityPack.HtmlDocument) As List(Of String)
Dim XPathList As New List(Of String)
Dim XPath As String = String.Empty
For Each Child As HtmlAgilityPack.HtmlNode In Document.DocumentNode.ChildNodes
If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
GetXPaths(Child, XPathList, XPath)
End If
Next ' child'
Return XPathList
End Function
Private Sub GetXPaths(ByVal Node As HtmlAgilityPack.HtmlNode,
ByRef XPathList As List(Of String),
Optional ByVal XPath As String = Nothing)
XPath = Node.XPath
If Not XPathList.Contains(XPath) Then
XPathList.Add(XPath)
End If
For Each Child As HtmlAgilityPack.HtmlNode In Node.ChildNodes
If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
GetXPaths(Child, XPathList, XPath)
End If
Next ' child
End Sub
这些是我用来检索某些值的XPath,我希望在上面的函数中获得或多或少相同的XPath完全限定表示。
Title = node.SelectSingleNode(".//div[@class='release']/h4/a[@title]").GetAttributeValue("title", "Unknown Title")
Cover = node.SelectSingleNode(".//div[@class='thumb']/a/img[@src]").GetAttributeValue("src", String.Empty)
Year = node.SelectSingleNode(".//div[@class='release-year']/span").InnerText
Genres = (From genre In node.SelectNodes(".//div[@class='genre']/a") Select genre.InnerText).ToArray
URL = node.SelectSingleNode(".//div[@class='release']/h4/a[@href]").GetAttributeValue("href", "Unknown URL")
答案 0 :(得分:1)
如果相应的元素具有类属性:
,则会将类属性过滤器附加到XPathPrivate Sub GetHtmlXPaths(ByVal Node As HtmlAgilityPack.HtmlNode,
ByRef XPathList As List(Of String),
Optional ByVal XPath As String = Nothing)
XPath &= Node.XPath.Substring(Node.XPath.LastIndexOf("/"c))
Const ClassNameFilter As String = "[@class='{0}']"
Dim ClassName As String = Node.GetAttributeValue("class", String.Empty)
If Not String.IsNullOrEmpty(ClassName) Then
XPath &= String.Format(ClassNameFilter, ClassName)
End If
If Not XPathList.Contains(XPath) Then
XPathList.Add(XPath)
End If
For Each Child As HtmlAgilityPack.HtmlNode In Node.ChildNodes
If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
GetHtmlXPaths(Child, XPathList, XPath)
End If
Next Child
End Sub