Question

我需要帮助以下代码来导航并从start_urls中提到的链接中的其余页面获取数据。请帮忙

class texashealthspider(CrawlSpider):

    name="texashealth2"
    allowed_domains=['www.texashealth.org']
    start_urls=['http://jobs.texashealth.org/search/']

    rules=(
        Rule(SgmlLinkExtractor(allow=("startrow=\d",)),callback="parse",follow=True),
        )

    def parse(self, response):
        hxs=HtmlXPathSelector(response)
        titles=hxs.select('//tbody/tr/td')
        items = []

    for titles in titles:
        item=TexashealthItem()
        item['title']=titles.select('span[@class="jobTitle"]/a/text()').extract()
        item['link']=titles.select('span[@class="jobTitle"]/a/@href').extract()
        item['shifttype']=titles.select('span[@class="jobShiftType"]/text()').extract()
        item['location']=titles.select('span[@class="jobLocation"]/text()').extract()
        items.append(item)
    print items
    return items

Answer 1

删除allowed_domains=['www.texashealth.org']中的限制，将其设为allowed_domains=['texashealth.org']或allowed_domains=['jobs.texashealth.org'] - 否则将抓取否页面

btw，请考虑从docs：

更改函数名称

警告

编写爬网蜘蛛规则时，请避免使用parse作为回调，因为CrawlSpider使用parse方法本身来实现其逻辑。因此，如果您覆盖解析方法，则爬网蜘蛛将不再起作用。

scrapy导航到第一个抓取页面中列出的下一页

1 个答案: