Question

我正试图抓住这个页面：https://coinmarketcap.com/currencies/views/all/

所有行的td[2]中的

是一个链接。我试图让scrapy转到td中的每个链接，并抓取链接所代表的页面。以下是我的代码：

注意：另一个人帮助我做到这一点非常棒

class ToScrapeSpiderXPath(CrawlSpider):
    name = 'coinmarketcap'
    start_urls = [
        'https://coinmarketcap.com/currencies/views/all/'
    ]

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//td[2]/a',)), callback="parse", follow=True),
    )

    def parse(self, response):
        BTC = BTCItem()
        BTC['source'] = str(response.request.url).split("/")[2]
        BTC['asset'] = str(response.request.url).split("/")[4],
        BTC['asset_price'] = response.xpath('//*[@id="quote_price"]/text()').extract(),
        BTC['asset_price_change'] = response.xpath(
            '/html/body/div[2]/div/div[1]/div[3]/div[2]/span[2]/text()').extract(),
        BTC['BTC_price'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[1]/text()').extract(),
        BTC['Prct_change'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[2]/text()').extract()
        yield (BTC)

即使表超过600多个链接/页面，当我运行scrapy crawl coinmarketcap时，我只获得了19条记录。这意味着这个600+的列表只有19页。我没有看到问题停止刮擦。任何帮助将不胜感激。

由于

Answer 1

你的蜘蛛走得太深了：根据这个规则，它会在单个硬币的页面中找到并关注链接。您可以粗略地修复添加DEPTH_LIMIT = 1的问题，但您可以确定找到更优雅的解决方案。这里的代码对我有用（还有其他的小调整）：

class ToScrapeSpiderXPath(CrawlSpider):
    name = 'coinmarketcap'
    start_urls = [
        'https://coinmarketcap.com/currencies/views/all/'
    ]
    custom_settings = {
        'DEPTH_LIMIT': '1',
    }

    rules = (
        Rule(LinkExtractor(restrict_xpaths=('//td[2]',)),callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        BTC = BTCItem()
        BTC['source'] = str(response.request.url).split("/")[2]
        BTC['asset'] = str(response.request.url).split("/")[4]
        BTC['asset_price'] = response.xpath('//*[@id="quote_price"]/text()').extract()
        BTC['asset_price_change'] = response.xpath(
            '/html/body/div[2]/div/div[1]/div[3]/div[2]/span[2]/text()').extract()
        BTC['BTC_price'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[1]/text()').extract()
        BTC['Prct_change'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[2]/text()').extract()
        yield (BTC)

Scrapy Crawler仅提取680个网址中的19个

1 个答案: