具有%0D%0A

时间:2016-07-11 00:41:38

标签: url hyperlink scrapy double

我正在编写一个带有scrapy的crawlspider,并且无法使用正确的链接发出请求,它会以某种方式将基本网址与从网站提取的整个链接连接起来,这是我在日志中发出请求时所看到的示例

http://www.example.com/models%0D%0Ahttp://www.example.com/models/page46/info

我无法理解为什么它这样做会导致代码如果它有帮助,我需要scrapy使用cookie作为网站使用身份验证

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http.request import Request
from Profiles.items import Profiles
#from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.linkextractors import LinkExtractor


class MyCrawler(CrawlSpider):
    name = 'XSpider'
    allowed_domains = ['www.example.com']
    start_urls = ['http://www.example.com/welcome/models/46/name/']


    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="pagination"]//a'), callback='make_requests_from_url',follow=True),


    def make_requests_from_url(self, url):
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}
        return Request(url, cookies={'Iremoved cookiesfromThisPost':'ignorethispart'}, headers=headers ,dont_filter=True)


    def start_url2(self, response):
        item = Profiles()
        item['models_per_page'] = response.xpath('//div[@class="modelphoto"]/a//@href').extract()
       return item 

0 个答案:

没有答案