如何删除<a> elements from xPath?

时间:2016-04-27 11:33:57

标签: c# html xpath screen-scraping html-agility-pack

I'm making an applcation in C# with HTMLAgilityPack.

I have the following HTML structure:

<td colspan="3">
    <a href="tournament_detail.asp?EID=3">The North West Junior Champions League 2016</a>
    <br>
    St Bedes Sports Fields,  Manchester. M21 0TT</td>
</td>

I would like to pull out the address, excluding the <a> and the <br />

I have tried the following:

//div[@class='infobox']/table/tr/td[1][not a]

Here is the site I am trying to pull data from

我正在使用HTMLAgilityPack,所以我不相信我可以使用string()函数(或者至少我在尝试时会遇到异常)。 请不要将此标记为重复,因为我正在寻求澄清我是否可以使用它。

如何撤回地址?

1 个答案:

答案 0 :(得分:2)

添加谓词[not(a)]会导致XPath仅返回没有子<td>的{​​{1}}元素,这不是想要的结果。相反,添加<a>将从选定的/text()[normalize-space()]返回直接子,非空文本节点

<td>

输出

var raw = @"<td colspan='3'>
    <a href='tournament_detail.asp?EID=3'>The North West Junior Champions League 2016</a>
    <br>
    St Bedes Sports Fields,  Manchester. M21 0TT</td>";
var doc = new HtmlDocument();
doc.LoadHtml(raw);
var td = doc.DocumentNode.SelectSingleNode("//td/text()[normalize-space()]");
Console.WriteLine(td.InnerText.Trim());