Question

我想从HTML页面中提取描述。

我的div ID包含以下数据：

  <div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>
 <p>
    <strong>Responsibilities</strong>
  </p>
  <ul>
     <li> Ownership and oversight of full-cycle accounts payable responsibilities including but not limited to, invoice processing, maintaining vendor records, running payment reports according to payment schedules, reconciling vendor statements)</li>
     <li> Identify and implement process improvements and automation in appropriate areas throughout the AP cycle</li>
     <li> Provide excellent customer service to vendors and employees by researching and resolving inquiries in a timely manner</li>
     <li> Assist with month-end activities, accruals, reconciliation, preparing 1099s, and audit support</li>
   <li> Assist with ad-hoc requests</li>
  </ul>
 <p>
    <strong>Qualifications</strong>
 </p>
  <ul>
     <li> AA/AS degree or equivalent experience in accounting</li>
     <li> Three years or more of related experience</li>
     <li> Full cycle accounts payable knowledge</li>
  </ul>
  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>

这里我只需要数据标签。我不想要包含责任和资格的数据

<p>Responsibliites</p><ul> ... </ul>
<p>Qualifications</p><ul> .. </ul>

这不是必需的，并将其从XPATH中排除。

我正在使用以下代码：

sel.xpath(
        'description',
        '//div[@class="container page_op-detail"][not(descendant-or-self::p/strong[contains(text(), "Qualifications")]/../ul[1])]'
    ).extract()

这不起作用。请帮我创建XPath哪些项可以排除它。如何为这类查询编写XPATH？

预期输出：

<div class="container page_op-detail">
 <form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded" action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21">
 <span id="ajax-view-state-page-container" style="display: none">
 <p> Solving the world’s hardest problems is no easy task. Our engineers often find themselves in the midst of combat zones, disaster relief efforts or even worse, boardrooms. AP Specialists ensure that our engineers have every tool they need to crack some of the most challenging and puzzling problems on the planet. We do this by managing numerous relationships with vendors across the country and around the globe that can provide our engineers with everything they need to make the world a safer place. As our company continues to grow, we are constantly thinking about how to improve and automate processes so we can continue providing amazing outcomes in even more places across the world.</p>

  <p class="type-centered">
       Data is more organised...!!!
   </p>
  <p class="type-centered apply-button">
  </div>

Answer 1

假设form和span标记是空元素，您可以尝试使用此xpath：

/div[@class='container page_op-detail']/*[not(self::p[normalize-space(.)='Responsibilities']) 
                                        and not(self::ul[preceding-sibling::p[normalize-space(.)='Responsibilities']])
                                        and not(self::ul[preceding-sibling::p[normalize-space(.)='Qualifications']])
                                        and not(self::p[normalize-space(.)='Qualifications'])]

Answer 2

首先，你的html代码会遗漏几个结束标记，包括</form>, , 等。我假设以下html代码是正确的版本：

<div class="container page_op-detail">
<form id="j_id0:OpenPositionTemplate:j_id21" enctype="application/x-www-form-urlencoded"         action="/careers/OpenPosDetail?id=a0m80000002zvKeAAI" method="post" name="j_id0:OpenPositionTemplate:j_id21"></form>
<span id="ajax-view-state-page-container" style="display: none"></span>
<p> Solving the world’s hardest problems ... </p>
<p>
<strong>Responsibilities</strong>
</p>
<ul>
 <li> Ownership and oversight of full-cycle .....</li>
 <li> Identify and implement process improvements ...</li>
 <li> Provide excellent customer service to vendors ... </li>
 <li> Assist with month-end activities, accruals, ...</li>
<li> Assist with ad-hoc requests</li>
</ul>
<p>
<strong>Qualifications</strong>
</p>
<ul>
 <li> AA/AS degree or equivalent experience in accounting</li>
 <li> Three years or more of related experience</li>
 <li> Full cycle accounts payable knowledge</li>
</ul>
<p class="type-centered">
   Data is more organised...!!!
</p>
<p class="type-centered apply-button"></p>
</div>

可以使用以下方法提取第一个标记：

//div[@class="container page_op-detail"]/p[1]/text()

您需要的下一个标记可以通过以下方式提取：

//div[@class="container page_op-detail"]/p[@class="type-centered"]/text()

然后您可以使用项目加载器将两个提取添加到同一项“描述”中，如scrapy example here所示或如下所示：

rom scrapy.loader import ItemLoader
from myproject.items import Product

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')       
    l.add_xpath('name', '//div[@class="product_title"]')  //note: item 'name' are used twice.
    return l.load_item()

编写XPath以选择描述

2 个答案: