Scrapy:多表单请求失败而没有错误

时间:2016-04-12 20:03:41

标签: python scrapy web-crawler scrapy-spider

我写了一个蜘蛛,它以一个变化的参数交替的多表格请求开始,这样:

import scrapy

class TestSpider(scrapy.Spider):
    name = “Test Spider“

    allowed_domains = [“somewhere.com”]

    PARAMS = [
        ‘FOO’,
        ‘BAR’,
        etc
    ]

    def start_requests(self):
        for param in self.PARAMS:
            yield scrapy.FormRequest(
                "http://somewhere.com/search.do?action=advanced",
                formdata={
                    ‘param’: param
                },
                callback=self.__first_page,
                meta={'cookiejar': 'initial-session’,’param’: param}
            )

    def __first_page(self, response):
        yield scrapy.FormRequest(
            "http://somewhere.com/advancedSearch.do?action=firstPage",
            formdata={
                ‘param’: param
            },
            callback=self._first_page_search,
            meta={'cookiejar': response.meta['cookiejar']}
        )

    def _first_page_search(self, response):
        yield self.__applications_page(response, '1')

    def __applications_page(self, response, page):
        return scrapy.FormRequest(
            "http://somewhere.com/pagedSearch.do",
            formdata={
                'searchCriteria.page': page
            },
            callback=self.__parse_data,
            meta={'cookiejar': response.meta['cookiejar'], 'page': page}
        )

    def __parse_data(self, response):
        page = response.meta['page']
        if page == '1':
            number_of_pages = len(response.xpath('//div[@id="searchResultsContainer"]/p[@class="pager top"]/a[@class="page"]/@href').extract())
            print "NUMBER OF PAGES",number_of_pages
            for p in range(2, number_of_pages + 2):
                yield self.__applications_page(response, str(p))

        links = response.xpath('//ul[@id="searchresults"]//li[@class="searchresult"]/a/@href').extract()
        for url in links:
            yield scrapy.Request(url='http://somewhere.com' + url, callback=self.__parse_single_item)

    def __parse_single_item(self, response):
        item = MyData.MyData()
        …

    return item

当开始时PARAMS中只有一个项目时,抓取就可以了。因此,如果仅针对FOOBAR等运行,我会获得正确的抓取数据。当我拥有所有参数(21是完整列表)时,该过程失败,因为它看起来只有2/3参数被尝试被抓取并且该过程退出而没有任何错误(没有500 s,只有{{ 1}} s并且没有通用Python错误。)

我写蜘蛛的方式有错误吗?

0 个答案:

没有答案