我想从Scrapy中刮除https://www.thingiverse.com/thing:3270948/remixes上的所有“皮带”。
首先,我要编写适当的请求。 我试过:
scrapy.FormRequest(url="https://www.thingiverse.com/thing:3270948/remixes",
method="POST",
formdata={
'page': '7',
'id': '3270948'},
headers={
'x-requested-with': 'XMLHttpRequest',
'content-type':
['application/x-www-form-urlencoded',
'charset=UTF-8']}
响应仅包含第一页(24个传送带)。如何编写适当的要求以获得下一条/整条皮带?
答案 0 :(得分:1)
您在请求有效负载中有更多参数,我已经从“网络”标签中复制了所有参数:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.thingiverse.com/thing:3270948/remixes']
ajax_url = 'https://www.thingiverse.com/ajax/things/remixes'
payload = 'id=3270948&auto_scroll=true&page={}&total=153&per_page=24&last_page=7&base_url=%2Fthing%3A3270948%2Fremixes%2F&extra_path=&%24container=.results-container&source=%2Fajax%2Fthings%2Fremixes'
def parse(self, response):
page = response.meta.get('page', 1)
# why 7: check `last_page` param in payload
if page == 7:
return
print '----'
# just to show that content is always different, so pages are different
print page, response.css('div.item-header a span::text').getall()[:3]
print '----'
yield scrapy.Request(self.ajax_url,
method='POST',
headers={
'x-requested-with': 'XMLHttpRequest',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
},
body=self.payload.format(page + 1),
meta={'page': page + 1}
)