我写了一个蜘蛛,它以一个变化的参数交替的多表格请求开始,这样:
import scrapy
class TestSpider(scrapy.Spider):
name = “Test Spider“
allowed_domains = [“somewhere.com”]
PARAMS = [
‘FOO’,
‘BAR’,
etc
]
def start_requests(self):
for param in self.PARAMS:
yield scrapy.FormRequest(
"http://somewhere.com/search.do?action=advanced",
formdata={
‘param’: param
},
callback=self.__first_page,
meta={'cookiejar': 'initial-session’,’param’: param}
)
def __first_page(self, response):
yield scrapy.FormRequest(
"http://somewhere.com/advancedSearch.do?action=firstPage",
formdata={
‘param’: param
},
callback=self._first_page_search,
meta={'cookiejar': response.meta['cookiejar']}
)
def _first_page_search(self, response):
yield self.__applications_page(response, '1')
def __applications_page(self, response, page):
return scrapy.FormRequest(
"http://somewhere.com/pagedSearch.do",
formdata={
'searchCriteria.page': page
},
callback=self.__parse_data,
meta={'cookiejar': response.meta['cookiejar'], 'page': page}
)
def __parse_data(self, response):
page = response.meta['page']
if page == '1':
number_of_pages = len(response.xpath('//div[@id="searchResultsContainer"]/p[@class="pager top"]/a[@class="page"]/@href').extract())
print "NUMBER OF PAGES",number_of_pages
for p in range(2, number_of_pages + 2):
yield self.__applications_page(response, str(p))
links = response.xpath('//ul[@id="searchresults"]//li[@class="searchresult"]/a/@href').extract()
for url in links:
yield scrapy.Request(url='http://somewhere.com' + url, callback=self.__parse_single_item)
def __parse_single_item(self, response):
item = MyData.MyData()
…
return item
当开始时PARAMS
中只有一个项目时,抓取就可以了。因此,如果仅针对FOO
或BAR
等运行,我会获得正确的抓取数据。当我拥有所有参数(21是完整列表)时,该过程失败,因为它看起来只有2/3参数被尝试被抓取并且该过程退出而没有任何错误(没有500
s,只有{{ 1}} s并且没有通用Python错误。)
我写蜘蛛的方式有错误吗?