尝试解析news.google
<a target="_blank"class="article usg-AFQjCNFr5aujpYnTzdHNYfHZw_gNN6iq-w sig2-1esugE2Sy8Bhe2CzulGmsA did--5114870031117960448 esc-thumbnail-link" href="http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/" url="http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/" id="MAA4AEgAUABgAWoCY2E" ssid="h" >
&#13;
我想要url属性。我无法获取url属性。所有我都得到了空参考。
XPath查找此多属性元素:
HtmlNode aNodes = doc.DocumentNode.SelectSingleNode("//a[@target='_blank' and @class='article usg-AFQjCNFr5aujpYnTzdHNYfHZw_gNN6iq-w sig2-1esugE2Sy8Bhe2CzulGmsA did--5114870031117960448 esc-thumbnail-link' and @href='http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/' and @url='http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/' and @id='MAA4AEgAUABgAWoCY2E' and @ssid='h']");
我只是试图找到这个元素得到一个空引用。 url和href等属性值总是在变化。有没有办法根据元素中的属性获取url而不是属性值?如果元素具有这五个属性,那么选择url的值?非常感谢你。
答案 0 :(得分:1)
是的,可以通过 presence 属性选择元素,而不是特定属性值:
测试HTML:
var html = @"
<!-- match -->
<a target='_blank'class='article usg-AFQjCNFr5aujpYnTzdHNYfHZw_gNN6iq-w sig2-1esugE2Sy8Bhe2CzulGmsA did--5114870031117960448 esc-thumbnail-link' href='http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/' url='http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/' id='MAA4AEgAUABgAWoCY2E' ssid='h' ></a>
<!-- NO match, missing url -->
<a target='_blank' href='NO MATCH'' ssid='' id='' class=''></a>
<!-- match -->
<a target='_blank' href='#' ssid='' id='' class='' url='MATCH'><a/>
<!-- NO match, missing multiple wanted attributes -->
<a target='_blank' href='#' url='NO MATCH'></a>
";
还有一点LINQ:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var wantedLinks = from a in document.DocumentNode.SelectNodes("//a")
where a.Attributes["url"] != null
&& a.Attributes["ssid"] != null
&& a.Attributes["href"] != null
&& a.Attributes["id"] != null
&& a.Attributes["class"] != null
&& a.Attributes["target"] != null
select a;
foreach (var a in wantedLinks)
{
Console.WriteLine(a.Attributes["url"].Value);
}
输出 - 缺少所有六个属性的通知链接被跳过:
http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/
MATCH