我正在尝试解析div class="base shortstory
:
<div id="dle-content">
<div class="base shortstory">
<h3 class="btl"><a href="http://someurl.com/htc-jetstream.html">HTC Jetstream</a></h3>
</div>
<div class="base shortstory">
<h3 class="btl"><a href="http://someurl.com/samsung.html">Samsung S4</a></h3>
</div>
<div class="base shortstory">
<h3 class="btl"><a href="http://someurl.com/dell.html">Dell Streak</a></h3>
</div>
</div>
这是代码
const string url = "http://someurl.com/catalogue";
const string rootUrl = "http://someurl.com";
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(url);
int dealsCount = 0;
HtmlNode root = doc.DocumentNode.SelectSingleNode("//div[@id='dle-content']");
int i = 1;
//this is for the default page
while (i<=10)
{
try
{
string node= String.Format("//div[{0}]", i);
var link =
doc.DocumentNode.SelectSingleNode(node);
var href = link.SelectSingleNode("//div[@class='mlink']//span[@class='argmore']//a[@href]").Attributes["href"].Value;
string title = link.SelectSingleNode("//h3[@class='btl']//a[@href]").InnerText.Trim();
string description = link.SelectSingleNode("//div[@class='maincont']//div[1]").InnerText.Replace("\n", " ").Replace("\r", "").Replace("\t", "").Trim();
description = RemoveHTMLComments(description);
var imageURL = link.SelectSingleNode("//div[@class='maincont']//div[1]//a//img").Attributes["src"].Value;
var price = link.SelectSingleNode("//div[@class='mlink']//span[3]//font").InnerText.Trim();
price = Regex.Match(price, @"\d+").Value;
var partnerdealID = href;
//no information
var isActivesStr = link.SelectSingleNode("//div[@class='mlink']//span[2]/font").InnerText.Trim();
bool isActive;
if (isActivesStr.Contains("Нет в наличии"))
{
isActive = false;
}
else
{
isActive = true;
}
var dealUrl = href; //requires login - show the page itself
}
catch (Exception)
{
}
i += 1;
}
但是在循环之后,所选节点仍然是第一个节点。我做错了什么?
答案 0 :(得分:2)
所有XPATH表达式都以'//'开头,这意味着“从文档的根开始并递归搜索”。所以当你这样做时:
link.SelectSingleNode("//div[@class='mlink']//span[@class='argmore']//a[@href]")
您不会从link
开始,而是从文档的根开始。你可能想要这样做:
link.SelectSingleNode("div[@class='mlink']...etc...")
相当于
link.SelectSingleNode("./div[@class='mlink']...etc...")
''表示当前节点。 '/'表示只搜索直接的孩子,而不是递归。