如何使用selenium和Python抓取嵌套数据>

时间:2017-03-27 14:18:40

标签: python selenium web screen-scraping

我基本上想在<span class="visually-hidden">下抓取 2016年2月 - 现在,但我看不到它。这是代码中的HTML:

<div class="pv-entity__summary-info">

<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>

<h4>
  <span class="visually-hidden">Company Name</span>
  <span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>


  <div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
      <span class="visually-hidden">Dates Employed</span>
      <span>Feb 2016 – Present</span>
    </h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
        <span class="visually-hidden">Employment Duration</span>
        <span class="pv-entity__bullet-item">1 yr 2 mos</span>
      </h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
      <span class="visually-hidden">Location</span>
      <span class="pv-entity__bullet-item">London, United Kingdom</span>
    </h4></div>

</div>

以下是我目前在代码中使用selenium所做的事情:

        date= browser.find_element_by_xpath('.//div[@class = "pv-entity__duration de Sans-15px-black-55% ml0"]').text
        print date

但这没有结果。我怎么去拉日期?

2 个答案:

答案 0 :(得分:2)

div没有class="pv-entity__duration de Sans-15px-black-55% ml0"h4。如果您想获取div的文字,请尝试:

date= browser.find_element_by_xpath('.//div[@class = "pv-entity__position-info detail-facet m0"]').text
print date

如果您想获得"Feb 2016 - Present",请尝试

date= browser.find_element_by_xpath('//h4[@class="pv-entity__date-range Sans-15px-black-55%"]/span[2]').text
print date

答案 1 :(得分:0)

您可以像这样重写xpath代码:

# -*- coding: utf-8 -*-
from lxml import html
import unicodedata


html_str = """
<div class="pv-entity__summary-info">

<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>

<h4>
  <span class="visually-hidden">Company Name</span>
  <span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>


  <div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
      <span class="visually-hidden">Dates Employed</span>
      <span>Feb 2016 – Present</span>
    </h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
        <span class="visually-hidden">Employment Duration</span>
        <span class="pv-entity__bullet-item">1 yr 2 mos</span>
      </h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
      <span class="visually-hidden">Location</span>
      <span class="pv-entity__bullet-item">London, United Kingdom</span>
    </h4></div>

</div>
"""

root = html.fromstring(html_str)
# For fetching Feb 2016 â Present :
txt = root.xpath('//h4[@class="pv-entity__date-range Sans-15px-black-55%"]/span/text()')[1]
# For fetching 1 yr 2 mos :
txt1 = root.xpath('//h4[@class="pv-entity__duration de Sans-15px-black-55% ml0"]/span/text()')[1]
print txt
print txt1

这将导致:

Feb 2016 â Present
1 yr 2 mos