scrapy只爬一个级别的网站

时间:2012-02-23 03:50:14

标签: python web-crawler scrapy

我正在使用scrapy抓取域下的所有网页。

我看到了this个问题。但是没有解决方案。我的问题似乎与此类似。我的crawl命令输出如下所示:

scrapy crawl sjsu2012-02-22 19:41:35-0800 [scrapy] INFO: Scrapy 0.14.1 started (bot: sjsucrawler)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled item pipelines: 
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider opened
2012-02-22 19:41:35-0800 [sjsu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-02-22 19:41:35-0800 [sjsu] DEBUG: Crawled (200) <GET http://cs.sjsu.edu/> (referer: None)
2012-02-22 19:41:35-0800 [sjsu] INFO: Closing spider (finished)
2012-02-22 19:41:35-0800 [sjsu] INFO: Dumping spider stats:
    {'downloader/request_bytes': 198,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 11000,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 788155),
     'scheduler/memory_enqueued': 1,
     'start_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 379951)}
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider closed (finished)
2012-02-22 19:41:35-0800 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 29663232, 'memusage/startup': 29663232}

这里的问题是抓取从第一页找到链接,但不访问它们。是什么用这种爬虫。

修改

我的抓取工具代码是:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class SjsuSpider(BaseSpider):
    name = "sjsu"
    allowed_domains = ["sjsu.edu"]
    start_urls = [
        "http://cs.sjsu.edu/"
    ]

    def parse(self, response):
        filename = "sjsupages"
        open(filename, 'wb').write(response.body)

我的所有其他设置都是默认设置。

3 个答案:

答案 0 :(得分:10)

我认为最好的方法是使用Crawlspider。因此,您必须将代码修改为以下内容,以便能够从第一页找到所有链接并访问它们:

class SjsuSpider(CrawlSpider):

    name = 'sjsu'
    allowed_domains = ['sjsu.edu']
    start_urls = ['http://cs.sjsu.edu/']
    # allow=() is used to match all links
    rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]

    def parse_item(self, response):
        x = HtmlXPathSelector(response)

        filename = "sjsupages"
        # open a file to append binary data
        open(filename, 'ab').write(response.body)

如果您要抓取网站中的所有链接(而不仅仅是第一级中的链接), 您必须添加规则以跟踪每个链接,因此您必须将规则变量更改为 这一个:

rules = [
    Rule(SgmlLinkExtractor(allow=()), follow=True),
    Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]

由于这个原因,我已将'parse'回调更改为'parse_item':

  

编写爬网蜘蛛规则时,请避免使用parse作为回调,因为CrawlSpider使用parse方法本身来实现其逻辑。因此,如果您覆盖解析方法,则爬网蜘蛛将不再起作用。

有关详细信息,请参阅:http://doc.scrapy.org/en/0.14/topics/spiders.html#crawlspider

答案 1 :(得分:2)

如果您正在使用basespider,则在解析方法/回调中,如果您打算访问这些网址,则需要提取所需的网址并返回Request个对象。

for url in hxs.select('//a/@href').extract():
            yield Request(url, callback=self.parse)

parse做的是回复您的回复,您必须告诉您要对此做些什么。它在docs中陈述。

或者,如果您希望使用CrawlSpider,那么您只需为spider定义规则。

答案 2 :(得分:0)

以防这是有用的。当爬虫在这种情况下不起作用时,请确保从蜘蛛文件中删除以下代码。这是因为如果在文件中声明了spider,则将spider配置为默认调用此方法。

def parse(self, response):
  pass