使用属性从特定的嵌套节点中提取文本

时间:2013-03-15 05:55:30

标签: python xpath

我正在尝试编写XPath,在<h3>, <ul> and <p>下选择div[@class="content"]标签,但p[position() > 1 and position() < last() - 1]

到目前为止,我有这个......

//div[@class="content"]/*[self::h3 or self::ul or self::p[position() > 1 and position() < last() - 1]]//text()

但它不起作用。

这是HTML:https://gist.github.com/umrashrf/5167711

1 个答案:

答案 0 :(得分:0)

确定您的XML格式不正确,所以我先解决了这个问题。

<?xml version="1.0" encoding="UTF-8"?>
<div class="content">
<h1/>
<h2>
    <p>Certified Nursing Assistant - Full Time</p>
Job Summary</h2>
<p>Responsible for providing personal care and assistance for residents in long    
term care facility.</p>
<h2>
</h2>
<h3>Essential Functions:</h3>
<ul>
    <li>
        <span style="line-height: 1.5;">Responsible</span> for providing   
personal care and assistance to residents </li>
    <li>Assist residents in and out of bed, dressing, feeding, grooming and 
personal hygiene. </li>
    <li>Provide basic treatments as required and directed by nursing staff.  
</li>
    <li>Responsible for observing and reporting changes in residents' physical 
and emotional conditions to charge nurse. </li>
</ul>
<h3>Qualifications: </h3>
<p>Education:</p>
<ul>
    <li>High school diploma or equivalent </li>
    <li>Successful completion of state approved certified nursing assistance 
course </li>
</ul>
<p>Experience:</p>
<ul>
    <li>Previous health care related experience preferred </li>
</ul>
<a id="ctl00_ctl01_namelink" class="btn" href="employment-application.aspx?
positionid=34">Apply Online</a>
<br/>
<br/>
<h2>
Apply in Person</h2>
<p>
To apply in persion please stop by Shenandoah Medical Center to pick up a job 
application.</p>
<h2>
Apply by Mail</h2>
<p>
To apply by mail, download and print <a target="_blank" href="/filesimages/Careers/SMC 
Employment Application.pdf">
    this form</a>. Please fill out the application and then mail to:<br/>
    <br/>
    <strong>Shenandoah Medical Center, Human Resources<br/>
    </strong>300 Pershing Avenue<br/>
Shenandoah, IA 51601</p>
</div>

现在,如果我正确理解你的问题,你正在寻找所有h3,ul和p标签,它们是div [@ class =“content”]的子节点,每个选定的子节点必须满足条件[position( )&gt; 1和位置()&lt; last() - 1]。为此,我认为这个单一的XPATH会做:

//div[@class="content"]/h3[position() > 1 and position() < last() - 1]  |        
//div[@class="content"]/p[position() > 1 and position() < last() - 1]  |  
//div[@class="content"]/ul[position() > 1 and position() < last() - 1]