Question

我正在编写一个爬虫来获取网站上的项目名称。该网站每页有25个项目和多个页面（某些项目类型为200）。

以下是代码：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from lonelyplanet.items import LonelyplanetItem

class LonelyplanetSpider(CrawlSpider):
    name = "lonelyplanetItemName_spider"
    allowed_domains = ["lonelyplanet.com"]
    def start_requests(self):
        for i in xrange(8):
            yield self.make_requests_from_url("http://www.lonelyplanet.com/europe/sights?page=%d" % i)

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//h2')
    items = []
    for site in sites:
        item = LonelyplanetItem()
        item['name'] = site.select('a[@class="targetUrl"]/text()').extract()
        items.append(item)
    return items

当我运行爬虫并以csv格式存储数据时，数据不按顺序存储，即 - 第2页数据存储在第1页或第3页之前存储在第2页之前，类似地。有时，在存储特定页面的所有数据之前，另一页面的数据会进入，并且它们将再次存储前一页面的其余数据。

Answer 1

scrapy是一个异步框架。它使用非阻塞IO，因此它不会在开始下一个请求之前等待请求完成。

由于一次可以提出多个请求，因此无法知道parse()方法获得响应的确切顺序。

我的观点是，scrapy并不意味着按特定顺序提取数据。如果你绝对需要保留订单，这里有一些想法： Scrapy Crawl URLs in Order

Scrapy不按顺序抓取后续页面

1 个答案: