我为抓取一个网站制作了一个蜘蛛: www.docteur.ch/generalistes/generalistes_k_ag.html
它使用以下格式抓取表的td:
<table class="novip">
<tr class="novip">
<td class="novip-portrait-picture"
rowspan="5">
<a class="novip-portrait-picture"
href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html">
<img class="novip-portrait-picture"
src="/customer_controlled/pictures/65903/portrait/65903.png"
alt="Pas d'image encore"
onError="portrait_m_image_failover(this)" />
</a>
</td>
<td class="novip-left">
<a class="novip-firmen-name"
href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html"
target="_top">
Baumberger Hans Rudolf
</a>
</td>
<td class="novip-right"
width="25%">
<a class="novip"
href="/medecin/baumberger-hans-rudolf-aarau-5000-medecin.html"
target="_top">
rating info: <img class="novip-inforating"
src="/img/general/stars/stars3 "
alt="rating info"
width="70" height="14" align="bottom" border="0" />
</a>
</td>
</tr>
<tr class="novip">
<td class="novip-left">
Dr. med. Facharzt FMH für Allgemeine Innere Medizin
</td>
</tr>
<tr class="novip">
<td class="novip-left">
Bahnhofstrasse 92, 5000 Aarau
</td>
<td class="novip-right-telefon">
tél: 062 822 46 28
</td>
</tr>
<tr class="novip">
<td class="novip-left-email">
e-mail:
<a class="novip-left-send-message-button-inactive"
href="/eintrag/fr_keine_mitteilung_moeglich.html">
Envoyer un message
</a>
<a class="novip-left-make_appointment-button-inactive"
href="/eintrag/fr_kein_termin_moeglich.html">
prendre un rendez-vous
</a>
</td>
<td class="novip-right-fax">
fax: 062 822 35 20
</td>
</tr>
</table>
我只想用以下代码提取医生的姓名:
import scrapy
from docteur.items import DocteurItem
class DocteurGeneralistSpider(scrapy.Spider):
name = "docteur_generalist"
allowed_domains = ["docteur.ch"]
start_urls = [
'http://www.docteur.ch/generalistes/generalistes_k_ag.html',
]
def parse(self, response):
for sel in response.xpath('//table/tr[@class="novip"]'):
item = DocteurItem()
item['name'] = sel.xpath('.//td[2]/a[@class="novip-firmen-name"]/text()[normalize-space()]').extract_first(default='not-found')
#item['phone'] = sel.xpath('.//td[@class="novip-right-telefon"]/text()[normalize-space()]').extract_first()
yield item
我提取名称但是每个条目也有两个空字段,尽管页面的源代码中没有空的td
[{"name": "\n Baumberger\u00a0Hans Rudolf\n "},
{"name": "not-found"},
{"name": "not-found"},
{"name": "not-found"},
{"name": "\n Bettschart\u00a0Robert\n "},
{"name": "not-found"},
{"name": "not-found"},
{"name": "not-found"},
....]
我的代码有什么问题?如何仅提取具有值的单元格?
答案 0 :(得分:1)
这将获得所有名称:
names = response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]//text()').extract()
它只返回467个名字:
In [14]: names = response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]')
In [15]: len(names)
Out[15]: 467
当你检查所有trs时,你会得到空的结果,所以当你找不到class="novip-firmen-name"
的那个时,你会得到你的默认值。
如果我们采取前几个你可以看到我们发生了什么:
In [23]: for sel in response.xpath('//table/tr[@class="novip"]')[:5]:
print(sel.xpath('.//td[2]/a[@class="novip-firmen-name"]'))
....:
[<Selector xpath='.//td[2]/a[@class="novip-firmen-name"]' data=u'<a class="novip-firmen-name" href="/mede'>]
[]
[]
[]
[<Selector xpath='.//td[2]/a[@class="novip-firmen-name"]' data=u'<a class="novip-firmen-name" href="/mede'>]
如果您仅搜索具有class="novip-firmen-name"
的锚标记,则可以获得所需内容:
In [38]: for sel in response.xpath('//table/tr[@class="novip"]//a[@class="novip-firmen-name"]')[:5]:
print(sel.xpath('.//text()').extract_first().strip())
....:
Baumberger Hans Rudolf
Bettschart Robert
Bock Andreas
Brändli Heinrich
Buchser Marcel
或者你可以搜索带有你想要获得那些tds的类的锚标签的tds:
In [39]: for sel in response.xpath('//table/tr[@class="novip"]/td[a[@class="novip-firmen-name"]]')[:5]:
print(sel.xpath('./a/text()').extract_first()).strip()
....:
Baumberger Hans Rudolf
Bettschart Robert
Bock Andreas
Brändli Heinrich
Buchser Marcel
答案 1 :(得分:0)
使用CSS选择器替代@ Padraic的答案:
在scrapy shell中:
>>> for row in response.css('table.novip > tr.novip:first-child'):
... print("----------")
... s = row.css('a.novip-firmen-name').xpath('normalize-space()').extract_first()
... pprint(s)
...
----------
u'Baumberger\xa0Hans Rudolf'
----------
u'Bettschart\xa0Robert'
----------
u'Bock\xa0Andreas'
----------
u'Br\xe4ndli\xa0Heinrich'
----------
u'Buchser\xa0Marcel'
----------
u'B\xfchlmann\xa0Severin'
----------
u'Dang\xa0Linh'
(...)
----------
u'Vonesch\xa0Hans-J\xfcrg'
----------
u'Koppe\xa0Dagmar'
>>>