我正在编写一个带有scrapy的crawlspider,并且无法使用正确的链接发出请求,它会以某种方式将基本网址与从网站提取的整个链接连接起来,这是我在日志中发出请求时所看到的示例
http://www.example.com/models%0D%0Ahttp://www.example.com/models/page46/info
我无法理解为什么它这样做会导致代码如果它有帮助,我需要scrapy使用cookie作为网站使用身份验证
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http.request import Request
from Profiles.items import Profiles
#from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.linkextractors import LinkExtractor
class MyCrawler(CrawlSpider):
name = 'XSpider'
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com/welcome/models/46/name/']
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@class="pagination"]//a'), callback='make_requests_from_url',follow=True),
def make_requests_from_url(self, url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}
return Request(url, cookies={'Iremoved cookiesfromThisPost':'ignorethispart'}, headers=headers ,dont_filter=True)
def start_url2(self, response):
item = Profiles()
item['models_per_page'] = response.xpath('//div[@class="modelphoto"]/a//@href').extract()
return item