Python LinkExtractor转到下一页并不起作用

时间:2016-02-08 15:16:43

标签: python scrapy web-crawler scrapy-spider

接下来是一段代码,我必须尝试抓取超过1页的网站...我很难让规则类正常运行。我做错了什么?

#import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import SkodaItem

class SkodaSpider(CrawlSpider):
    name = "skodas"
    allowed_domains = ["marktplaats.nl"]
    start_urls = [
        "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
    ]

    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=True),
    ]

#    def parse_item(self, response):
    def parse(self, response):
        #self.logger.info('Hi, this is an item page! %s', response.url)
        x = 0
        items = []
        for sel in response.xpath('//*[@id="search-results"]/section[2]/article'):
            x = x + 1
            item = SkodaItem()
            item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')
            #print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()
            item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')
            item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')
            item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')

            #handle output (print or safe to database)
            items.append(item)
            print item ["title"],item["leeftijd"],item["prijs"],item["km"]

1 个答案:

答案 0 :(得分:0)

要改变的一些事情:

  

编写爬网蜘蛛规则时,请避免使用parse作为回调,因为CrawlSpider使用parse方法本身来实现其逻辑。因此,如果您覆盖解析方法,则爬网蜘蛛将不再起作用。

  • 正如我在评论中提到的,您的XPath需要通过删除末尾的额外/a来修复(链接中的链接将与任何元素都不匹配)
  • 如果您想从后续页面中提取项目,
  • CrawlSpider规则需要回调方法
  • 还要解析起始网址中的元素,您需要定义parse_start_url method

这是一个简约的CrawlSpider,跟随您的示例输入的3页,并打印出多少&#34;文章&#34;每页都有:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class SkodaSpider(CrawlSpider):
    name = "skodas"
    allowed_domains = ["marktplaats.nl"]
    start_urls = [
        "http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
    ]

    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),
             follow=True,
             callback='parse_page'),
    ]

    def parse_page(self, response):
        articles = response.css('#search-results > section + section > article')
        self.logger.info('%d articles' % len(articles))

    # define this, otherwise "parse_page" will not be called for the URLs in start_urls
    parse_start_url = parse_page

输出:

$ scrapy runspider 001.py 
2016-02-09 11:07:16 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-09 11:07:16 [scrapy] INFO: Optional features available: ssl, http11
2016-02-09 11:07:16 [scrapy] INFO: Overridden settings: {}
2016-02-09 11:07:16 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-09 11:07:16 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-09 11:07:16 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-09 11:07:16 [scrapy] INFO: Enabled item pipelines: 
2016-02-09 11:07:16 [scrapy] INFO: Spider opened
2016-02-09 11:07:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-09 11:07:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-09 11:07:16 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always> (referer: None)
2016-02-09 11:07:16 [skodas] INFO: 32 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always)
2016-02-09 11:07:17 [skodas] INFO: 30 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=3&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151&currentPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010)
2016-02-09 11:07:17 [skodas] INFO: 7 articles
2016-02-09 11:07:17 [scrapy] INFO: Closing spider (finished)
2016-02-09 11:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1919,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 96682,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179),
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'request_depth_max': 2,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)}
2016-02-09 11:07:17 [scrapy] INFO: Spider closed (finished)