虽然表中没有空值,但S​​crapy提取空td值

时间:2016-05-08 14:19:31

标签: python html web-scraping scrapy

我为抓取一个网站制作了一个蜘蛛: www.docteur.ch/generalistes/generalistes_k_ag.html

它使用以下格式抓取表的td:

<table class="novip">
        <tr class="novip">
          <td class="novip-portrait-picture"
            rowspan="5">
            <a class="novip-portrait-picture"
              href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html">
              <img class="novip-portrait-picture"
                src="/customer_controlled/pictures/65903/portrait/65903.png"
                alt="Pas d'image encore"
                onError="portrait_m_image_failover(this)" />
            </a>
          </td>
          <td class="novip-left">
            <a class="novip-firmen-name"
              href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html"
              target="_top">
              Baumberger&nbsp;Hans Rudolf
            </a>
          </td>
          <td class="novip-right"
            width="25%">
            <a class="novip"
              href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html"
              target="_top">
              rating info:&nbsp;              <img class="novip-inforating"
                src="/img/general/stars/stars3 "
                alt="rating info"
                width="70" height="14" align="bottom" border="0" />
            </a>
          </td>
        </tr>
        <tr class="novip">
          <td class="novip-left">
            Dr. med. Facharzt FMH f&uuml;r Allgemeine Innere Medizin
          </td>
        </tr>
        <tr class="novip">
          <td class="novip-left">
            Bahnhofstrasse&nbsp;92, 5000&nbsp;Aarau
          </td>
          <td class="novip-right-telefon">
            t&eacute;l:&nbsp;062 822 46 28
          </td>
        </tr>
        <tr class="novip">
          <td class="novip-left-email">
            e-mail:&nbsp;
            <a class="novip-left-send-message-button-inactive"
              href="/eintrag/fr_keine_mitteilung_moeglich.html">
              Envoyer un message
            </a>
              &nbsp;
            <a class="novip-left-make_appointment-button-inactive"
              href="/eintrag/fr_kein_termin_moeglich.html">
              prendre un rendez-vous
            </a>
          </td>
          <td class="novip-right-fax">
            fax:&nbsp;062 822 35 20
          </td>
        </tr>
      </table>

我只想用以下代码提取医生的姓名:

import scrapy

from docteur.items import DocteurItem


class DocteurGeneralistSpider(scrapy.Spider):
    name = "docteur_generalist"
    allowed_domains = ["docteur.ch"]
    start_urls = [
    'http://www.docteur.ch/generalistes/generalistes_k_ag.html',
    ]


    def parse(self, response):
        for sel in response.xpath('//table/tr[@class="novip"]'):
            item = DocteurItem()
            item['name'] = sel.xpath('.//td[2]/a[@class="novip-firmen-name"]/text()[normalize-space()]').extract_first(default='not-found')
            #item['phone'] = sel.xpath('.//td[@class="novip-right-telefon"]/text()[normalize-space()]').extract_first()
            yield item

我提取名称但是每个条目也有两个空字段,尽管页面的源代码中没有空的td

    [{"name": "\n              Baumberger\u00a0Hans Rudolf\n            "},
{"name": "not-found"},
{"name": "not-found"},
{"name": "not-found"},
{"name": "\n              Bettschart\u00a0Robert\n            "},
{"name": "not-found"},
{"name": "not-found"},
{"name": "not-found"},
....]

我的代码有什么问题?如何仅提取具有值的单元格?

2 个答案:

答案 0 :(得分:1)

这将获得所有名称:

 names = response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]//text()').extract()

它只返回467个名字:

In [14]: names = response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]')

In [15]: len(names)
Out[15]: 467

当你检查所有trs时,你会得到空的结果,所以当你找不到class="novip-firmen-name"的那个时,你会得到你的默认值。

如果我们采取前几个你可以看到我们发生了什么:

In [23]: for sel in response.xpath('//table/tr[@class="novip"]')[:5]:
             print(sel.xpath('.//td[2]/a[@class="novip-firmen-name"]'))
   ....:     
[<Selector xpath='.//td[2]/a[@class="novip-firmen-name"]' data=u'<a class="novip-firmen-name" href="/mede'>]
[]
[]
[]
[<Selector xpath='.//td[2]/a[@class="novip-firmen-name"]' data=u'<a class="novip-firmen-name" href="/mede'>]

如果您仅搜索具有class="novip-firmen-name"的锚标记,则可以获得所需内容:

 In [38]: for sel in response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]')[:5]:
         print(sel.xpath('.//text()').extract_first().strip())
....:     
Baumberger Hans Rudolf
Bettschart Robert
Bock Andreas
Brändli Heinrich
Buchser Marcel

或者你可以搜索带有你想要获得那些tds的类的锚标签的tds:

In [39]: for sel in response.xpath('//table/tr[@class="novip"]/td[a[@class="novip-firmen-name"]]')[:5]:
             print(sel.xpath('./a/text()').extract_first()).strip()
   ....:     
Baumberger Hans Rudolf
Bettschart Robert
Bock Andreas
Brändli Heinrich
Buchser Marcel

答案 1 :(得分:0)

使用CSS选择器替代@ Padraic的答案:

  1. 对于每个表,选择第一个子行,使用class&#34; novip&#34;
  2. 在每一行中,选择锚类#34; novip-firmen-name&#34;
  3. 在scrapy shell中:

    >>> for row in response.css('table.novip > tr.novip:first-child'):
    ...     print("----------")
    ...     s = row.css('a.novip-firmen-name').xpath('normalize-space()').extract_first()
    ...     pprint(s)
    ... 
    ----------
    u'Baumberger\xa0Hans Rudolf'
    ----------
    u'Bettschart\xa0Robert'
    ----------
    u'Bock\xa0Andreas'
    ----------
    u'Br\xe4ndli\xa0Heinrich'
    ----------
    u'Buchser\xa0Marcel'
    ----------
    u'B\xfchlmann\xa0Severin'
    ----------
    u'Dang\xa0Linh'
    (...)
    ----------
    u'Vonesch\xa0Hans-J\xfcrg'
    ----------
    u'Koppe\xa0Dagmar'
    >>>