Xpath联合多个查询

时间:2014-03-27 13:54:54

标签: php xpath web-scraping

我正在从其他网站上取消工作。源网站有不同的情况,因为用户复制粘贴数据和结构更改。

案例1:

<h3>Job Description</h3>
<div style="text-align: justify; line-height: 115%"><b>
Receptionist is assigned for ANAFAE-ALC based in Mazar-e-Sharif. This position is supervised by and reports to ALC Educational Program Manager and following are the main duties but are not limited to that.</div>

案例2:

<h3>Job Description</h3>
<p>
Receptionist is assigned for ANAFAE-ALC based in Mazar-e-Sharif. This position is supervised by and reports to ALC Educational Program Manager and following are the main duties but are not limited to that.</p>

在这种情况下,p标签有时会替换其他html标签。

案例3:

<h3>Job Description</h3>
Receptionist is assigned for ANAFAE-ALC based in Mazar-e-Sharif. This position is supervised by and reports to ALC Educational Program Manager and following are the main duties but are not limited to that.

我正在使用此字符串来获取内容。现在适用于案例3,但不适用于其他两种情况。如何解决这三种情况的问题。

//text()[preceding::h3[text()="Job Description"]

1 个答案:

答案 0 :(得分:0)

您的XPath表达式选择前面带有<h3>且文本节点等于&#34;作业描述&#34;的文本节点。这仅与第三种情况相符,因为前两种情况分别在<div>之后有<p><h3>

您可以尝试这样的事情:

//node()[preceding-sibling::*[1][self::h3 = "Job Description"]]/string()

一些细节:

//node()从初始上下文中选择所有元素或文本节点后代。

preceding-sibling::*[1]选择前面的第一个元素。

[self::h3 = "Job Description"]检查元素是<h3>,并且其字符串值等于&#34;作业描述&#34;。

/string()返回上下文节点的字符串值。对于您的示例内容,可以使用/descendant-or-self::text()。它的工作原理是选择上下文节点(如果它是文本节点),以及所有后代文本节点(如果它是元素)。但是,如果将<div><p>更改为具有混合内容(即插入文本节点的子元素),则该表达式将返回后代文本节点的序列,而/string()将它们连接在一起。