Scrapy忽略特定域

时间:2017-01-31 14:20:23

标签: python python-2.7 request scrapy

我尝试抓取craiglist.org(https://forums.craigslist.org/)的论坛类别。 我的蜘蛛:

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["forums.craigslist.org"]
    start_urls = ['http://geo.craigslist.org/iso/us/']

    def error_handler(self, failure):
        print failure

    def parse(self, response):
        yield Request('https://forums.craigslist.org/',
                  self.getForumPage,
                  dont_filter=True,
                  errback=self.error_handler)

    def getForumPage(self, response):
        print "forum page"

错误回调我收到此消息:

  

[失败实例:回溯::   /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:455:callback   /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:563:_startRunCallbacks   /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks   /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1316:gotResult   --- ---   /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1258:_inlineCallbacks   /usr/local/lib/python2.7/site-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator   /usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py:37:process_request   /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks   /usr/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py:46:process_request_2   ]

但我只有Craigslist的论坛部分才有这个问题。可能是因为论坛部分的https与网站其他部分相反。 所以,不可能得到回应......

一个想法?

2 个答案:

答案 0 :(得分:0)

我发布了一个解决问题的解决方案。

我使用过urllib2库。看:

import urllib2
from scrapy.http import HtmlResponse

class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']

def error_handler(self, failure):
    print failure

def parse(self, response):
    # Get a valid request with urllib2
    req = urllib2.Request('https://forums.craigslist.org/')
    # Get the content of this request
    pageContent = urllib2.urlopen(req).read()
    # Parse the content in a HtmlResponse compatible with Scrapy
    response = HtmlResponse(url=response.url, body=pageContent)
    print response.css(".forumlistcolumns li").extract()

使用此解决方案,您可以在有效的Scrapy请求中解析一个好的请求并使用此规范。 可能有更好的方法,但这个方法很有用。

答案 1 :(得分:0)

我认为你正在处理robots.txt。尝试使用

运行蜘蛛
custom_settings = {
    "ROBOTSTXT_OBEY": False
}

您还可以使用命令行设置对其进行测试:scrapy crawl craigslist -s ROBOTSTXT_OBEY=False