Question

我有以下HTML结构，其中包含很少的电子邮件列表，我想要抓取哪个电子邮件业务的电子邮件，而不是yahoo，gmail，hotmail等

<a href="#1">some@yahoo.com</a>
<a href="#2">s0m3@ymail.com</a>
<a href="#5">mail@yourbusiness.com</a>
<a href="#3">you@gmail.com</a>
<a href="#6">this@mybusinessmail.co.uk</a>
<a href="#4">me@hotmail.com</a>

所以我想要的是

mail@yourbusiness.com
this@mybusinessmail.co.uk

我的想法是

get A tag which NOT contain ymail AND NOT contain yahoo AND NOT contain gmail, AND NOT contain hotmail

但是如何根据上述想法编写XPath语法？

Answer 1

您可以使用substring-after和substring-before来获取 @ 之后和第一个 之前的部分。 与not和contains

相结合

因此，substring-before(substring-after(text(),"@"),'.')会获得域的第一部分，//a[not(contains("ymail yahoo gmail hotmail", ...))]会排除您想要的域名。

共

//a[not(contains("ymail yahoo gmail hotmail", substring-before(substring-after(text(),"@"),'.')))]

Answer 2

您的想法直接转换为XPath，如下所示：

//a[not(contains(., 'ymail')) and not(contains(., 'yahoo')) and not(contains(., 'gmail')) and not(contains(., 'hotmail'))]/text()

对于您的示例（添加了单个根元素），

<html>
 <a href="#1">some@yahoo.com</a>
 <a href="#2">s0m3@ymail.com</a>
 <a href="#5">mail@yourbusiness.com</a>
 <a href="#3">you@gmail.com</a>
 <a href="#6">this@mybusinessmail.co.uk</a>
 <a href="#4">me@hotmail.com</a>
</html>

选择

mail@yourbusiness.com
this@mybusinessmail.co.uk

按要求。

对于不包含特定值的所有元素文本的XPath

2 个答案: