在这段HTML代码中:
<div class="item">
<div class="thumb">
<a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads">
<img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
</div>
<div class="release">
<h3>Wolf Eyes</h3>
<h4>
<a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" title="Wolf Eyes - Lower Demos">Lower Demos</a>
</h4>
<script src="/ads/button.js"></script>
</div>
<div class="release-year">
<p>Year</p>
<span>2013</span>
</div>
<div class="genre">
<p>Genre</p>
<a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
<a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
</div>
</div>
我知道如何以其他方式解析它,但我想使用HTMLAgilityPack
库检索此信息:
Title : Wolf Eyes - Lower Demos Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg Year : 2013 Genres: Rock, Pop URL : http://www.mp3crank.com/wolf-eyes/lower-demos-121866
这些html行是什么:
Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year : <span>2013</span>
Genre1: <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
Genre2: <a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
URL : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866"
这就是我正在尝试的,但在尝试选择单个节点时,我总是遇到object reference not set
异常,
抱歉,我是HTML的新手,我试图按照这个问题的步骤进行HtmlAgilityPack basic how to get title and link?
Public Class Form1
Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing
Private Title As String = String.Empty
Private Cover As String = String.Empty
Private Genres As String() = {String.Empty}
Private Year As Integer = -0
Private URL as String = String.Empty
Private Sub Test() Handles MyBase.Shown
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")
' Loop trough the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode("//div[@class='release']").Attributes("title").Value
Cover = node.SelectSingleNode("//div[@class='thumb']").Attributes("src").Value
Year = CInt(node.SelectSingleNode("//div[@class='release-year']").Attributes("span").Value)
Genres = ¿select multiple nodes?
URL = node.SelectSingleNode("//div[@class='release']").Attributes("href").Value
Next
End Sub
End Class
答案 0 :(得分:2)
你在这里的错误是尝试从你找到的那个中访问一个子节点的属性。
当您调用node.SelectSingleNode("//div[@class='release']")
时,您会返回正确的div,但调用.Attributes
只会返回div
标记本身的属性,而不会返回任何内部HTML元素。
可以编写选择子节点的XPATH查询,例如: //div[@class='release']/a
- 有关XPATH的更多信息,请参阅http://www.w3schools.com/xpath/xpath_syntax.asp。虽然示例适用于XML,但大多数原则应适用于HTML文档。
另一种方法是在您找到的节点上使用更多XPATH调用。我修改了你的代码,使其能够使用这种方法:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Dim releaseNode = node.SelectSingleNode(".//div[@class='release']")
'Assumes we find the node and it has a a-tag
Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value
Dim thumbNode = node.SelectSingleNode(".//div[@class='thumb']")
Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value
Dim releaseYearNode = node.SelectSingleNode(".//div[@class='release-year']")
Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)
Dim genreNode = node.SelectSingleNode(".//div[@class='genre']")
Dim genreLinks = genreNode.SelectNodes(".//a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Next
请注意,在此代码中,我们假设文档已正确形成,并且每个节点/元素/属性都存在且正确。您可能希望为此添加大量错误检查,例如If someNode Is Nothing Then ....
编辑:我稍微修改了上面的代码,以确保每个.SelectSingleNode使用“.//”前缀 - 这确保它有效,如果有几个“项目”节点,否则它选择第一个匹配来自文档而不是当前节点。
如果您想要更短的XPATH解决方案,请使用以下方法使用相同的代码:
' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")
' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
Title = node.SelectSingleNode(".//div[@class='release']/h4/a[@title]").Attributes("title").Value
URL = node.SelectSingleNode(".//div[@class='release']/h4/a[@href]").Attributes("href").Value
Cover = node.SelectSingleNode(".//div[@class='thumb']/a/img[@src]").Attributes("src").Value
Year = CInt(node.SelectSingleNode(".//div[@class='release-year']/span").InnerText)
Dim genreLinks = node.SelectNodes(".//div[@class='genre']/a")
Genres = (From n In genreLinks Select n.InnerText).ToArray()
Console.WriteLine("Title : {0}", Title)
Console.WriteLine("Cover : {0}", Cover)
Console.WriteLine("Year : {0}", Year)
Console.WriteLine("Genres: {0}", String.Join(",", Genres))
Console.WriteLine("URL : {0}", URL)
Console.WriteLine()
Next
答案 1 :(得分:1)
你离解决方案的距离不远。两个重要的注意事项:
//
是一个递归调用。它可能会产生一些严重的性能影响,并且它可能会选择您不想要的节点,所以我建议您只在层次结构很深或复杂或可变时使用它,并且您不想指定整个路径。 / LI>
XmlNode
的{{1}}上有一个有用的辅助方法,即使它不存在,您也会得到一个属性(您需要指定默认值)。以下是一个似乎有用的示例:
GetAttributeValue