我有一个刮刀从网站上抓取数据嵌套数据,我的意思是,要进入数据页面,我必须点击5个链接,然后我到达数据页面,我刮去了数据
对于每个第一页,每个第2页有多个第2页 有很多第3页' 等等
所以这里我有一个解析功能用于打开每个页面,直到我到达包含数据的页面并将数据添加到项目类广告返回该项目。
但它没有抓取数据就跳过很多链接。在100个左右的链接*之后,它没有执行最后的 parse_link函数。那我怎么知道 parse_link函数没有执行?
这是因为我正在打印打印' \ n \ n&n 39,我执行了!!!!' 并且之后没有打印100个左右的链接,但代码每次执行 parse_then
我想知道的是我做得对吗?这是抓住像这样的网站的正确方法
这是代码
# -*- coding: utf-8 -*-
import scrapy
from urlparse import urljoin
from nothing.items import NothingItem
class Canana411Spider(scrapy.Spider):
name = "canana411"
allowed_domains = ["www.canada411.ca"]
start_urls = ['http://www.canada411.ca/']
def parse(self, response):
SET_SELECTOR = '.c411AlphaLinks.c411NoPrint ul li'
for attr in response.css(SET_SELECTOR):
linkse = 'a ::attr(href)'
link = attr.css(linkse).extract_first()
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_next)
def parse_next(self, response):
SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
for attr in response.css(SET_SELECTOR):
linkse = 'a ::attr(href)'
link = attr.css(linkse).extract_first()
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_more)
def parse_more(self, response):
SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
for attr in response.css(SET_SELECTOR):
linkse = 'a ::attr(href)'
link = attr.css(linkse).extract_first()
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_other)
def parse_other(self, response):
SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'
for attr in response.css(SET_SELECTOR):
linkse = 'a ::attr(href)'
link = attr.css(linkse).extract_first()
link = urljoin(response.url, link)
yield scrapy.Request(link, callback=self.parse_then)
def parse_then(self, response):
SET_SELECTOR = '.c411Cities li h3 a ::attr(href)'
link = response.css(SET_SELECTOR).extract_first()
link = urljoin(response.url, link)
return scrapy.Request(link, callback=self.parse_link)
def parse_link(self, response):
print '\n\n', 'I AM EXECUTED !!!!'
item = NothingItem()
namese = '.vcard__name ::text'
addressse = '.c411Address.vcard__address ::text'
phse = 'span.vcard__label ::text'
item['name'] = response.css(namese).extract_first()
item['address'] = response.css(addressse).extract_first()
item['phone'] = response.css(phse).extract_first()
return item
我做得对,还是有更好的方式让我失踪?