无法使用XPath选择具有多个属性的元素

时间:2017-04-30 15:40:06

标签: c# html xpath html-agility-pack

尝试解析news.google



<a target="_blank"class="article usg-AFQjCNFr5aujpYnTzdHNYfHZw_gNN6iq-w sig2-1esugE2Sy8Bhe2CzulGmsA did--5114870031117960448 esc-thumbnail-link" href="http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/" url="http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/" id="MAA4AEgAUABgAWoCY2E"  ssid="h" >
&#13;
&#13;
&#13;

我想要url属性。我无法获取url属性。所有我都得到了空参考。

XPath查找此多属性元素:

HtmlNode aNodes = doc.DocumentNode.SelectSingleNode("//a[@target='_blank' and @class='article usg-AFQjCNFr5aujpYnTzdHNYfHZw_gNN6iq-w sig2-1esugE2Sy8Bhe2CzulGmsA did--5114870031117960448 esc-thumbnail-link' and @href='http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/' and @url='http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/' and @id='MAA4AEgAUABgAWoCY2E' and @ssid='h']");

我只是试图找到这个元素得到一个空引用。 url和href等属性值总是在变化。有没有办法根据元素中的属性获取url而不是属性值?如果元素具有这五个属性,那么选择url的值?非常感谢你。

1 个答案:

答案 0 :(得分:1)

是的,可以通过 presence 属性选择元素,而不是特定属性

测试HTML:

var html = @"
<!-- match -->
<a target='_blank'class='article usg-AFQjCNFr5aujpYnTzdHNYfHZw_gNN6iq-w sig2-1esugE2Sy8Bhe2CzulGmsA did--5114870031117960448 esc-thumbnail-link' href='http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/' url='http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/' id='MAA4AEgAUABgAWoCY2E'  ssid='h' ></a>
<!-- NO match, missing url -->
<a target='_blank' href='NO MATCH'' ssid='' id='' class=''></a>
<!-- match -->
<a target='_blank' href='#' ssid='' id='' class='' url='MATCH'><a/>
<!-- NO match, missing multiple wanted attributes -->
<a target='_blank' href='#' url='NO MATCH'></a>
";

还有一点LINQ:

HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var wantedLinks = from a in document.DocumentNode.SelectNodes("//a")
    where a.Attributes["url"] != null
    && a.Attributes["ssid"] != null
    && a.Attributes["href"] != null
    && a.Attributes["id"] != null
    && a.Attributes["class"] != null
    && a.Attributes["target"] != null
    select a;

foreach (var a in wantedLinks)
{
    Console.WriteLine(a.Attributes["url"].Value);
}

输出 - 缺少所有六个属性的通知链接被跳过:

http://www.theglobeandmail.com/news/world/trump-blasts-media-in-rally-celebrating-100-days-as-president/article34858356/
MATCH