Question

我正试图抓住路透社搜索结果页面。它使用java脚本加载，如this question中所述。

我将numResultsToShow更改为超过2000，如9999或者说。页面上的总项目超过45000.无论我把它放在什么数字，scrapy只返回5000个被刮的物品。

我的代码如下：

class ReutersSpider(scrapy.Spider):
    name = "reuters"
    start_urls = [
        'https://www.reuters.com/assets/searchArticleLoadMoreJson?blob=steel.&bigOrSmall=big&articleWithBlog=true&sortBy=&dateRange=&numResultsToShow=9999&pn=1&callback=addMoreNewsResults',
    ]

    def parse(self, response):
        html = response.body.decode('utf-8')
        json_string = re.search( r'addMoreNewsResults\((.+?) \);', html, re.DOTALL ).group(1)

        #Below code is used to transform from Javascript-ish JSON-like structure to JSON
        json_string = re.sub( r'^\s*(\w+):', r'"\1":', json_string, flags=re.MULTILINE)
        json_string = re.sub( r'(\w+),\s*$', r'"\1",', json_string, flags=re.MULTILINE)
        json_string = re.sub( r':\s*\'(.+?)\',\s*$', r': "\1",', json_string, flags=re.MULTILINE)

        results = json.loads(json_string)

        for result in results["news"]:
            item = ReuterItem()
            item["href"] = result["href"]
            item["date"] = result["date"]
            item["headline"] = result["headline"]
            yield item

如何增加它以涵盖所有搜索结果。

Answer 1

抓取此类网站时需要考虑的因素不止一些，如果使用其内部API，则需要考虑更多因素。以下是我的经验中的一些建议点，没有特别的顺序：

由于您可能会在更改查询参数时提出大量请求，因此一个好的做法是动态构建它们，这样您就不会发疯。
始终尝试从您的请求中删除尽可能多的样板文件，例如额外的查询参数，标题等。使用Postman等类似工具来解决API的问题非常有用。最低工作要求。
随着蜘蛛变得越来越复杂和/或存在更复杂的爬行逻辑，将相关代码提取到单独的可用性和易于维护的方法中非常有用。
您可以在请求的meta中传递有价值的信息，这些信息将被复制到回复的元数据中。这在给定示例中非常有用，可以跟踪正在爬网的当前页面。或者，您只需从URL中提取页码即可使其更加健壮。
考虑是否需要任何Cookie才能访问某个页面。如果您没有合适的Cookie，您可能无法直接从API（或任何相关页面）获得回复。通常只有在继续之前访问主页面才足够，Scrapy将负责存储cookie。
始终保持礼貌，避免被禁止并对目标网站施加很大压力。如果可能，请使用高下载延迟，并保持低并发性。

所有这一切，我已经给了它一个快速的运行并组合了一个半工作的例子，这应该足以让你开始。还有一些改进，比如更复杂的重试逻辑，在cookie过期的情况下重新访问主页等等......

# -*- coding: utf-8 -*-

import json
import re
import urllib

import scrapy

class ReuterItem(scrapy.Item):
    href = scrapy.Field()
    date = scrapy.Field()
    headline = scrapy.Field()

class ReutersSpider(scrapy.Spider):
    name = "reuters"
    NEWS_URL = 'https://www.reuters.com/search/news?blob={}'
    SEARCH_URL = 'https://www.reuters.com/assets/searchArticleLoadMoreJson?'
    RESULTS_PER_PAGE = 1000
    BLOB = 'steel.'

    custom_settings = {
        # blend in
        'USER_AGENT': ('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0)'
                       ' Gecko/20100101 Firefox/40.1'),
        # be polite
        'DOWNLOAD_DELAY': 5,
    }

    def _build_url(self, page):
        params = {
            'blob': self.BLOB,
            'bigOrSmall': 'big',
            'callback': 'addMoreNewsResults',
            'articleWithBlog': True,
            'numResultsToShow': self.RESULTS_PER_PAGE,
            'pn': page
        }
        return self.SEARCH_URL + urllib.urlencode(params)

    def _parse_page(self, response):
        html = response.body.decode('utf-8')
        json_string = re.search( r'addMoreNewsResults\((.+?) \);', html, re.DOTALL ).group(1)
        #Below code is used to transform from Javascript-ish JSON-like structure to JSON
        json_string = re.sub( r'^\s*(\w+):', r'"\1":', json_string, flags=re.MULTILINE)
        json_string = re.sub( r'(\w+),\s*$', r'"\1",', json_string, flags=re.MULTILINE)
        json_string = re.sub( r':\s*\'(.+?)\',\s*$', r': "\1",', json_string, flags=re.MULTILINE)
        return json.loads(json_string)

    def start_requests(self):
        # visit the news page first to get the cookies needed
        # to visit the API in the next steps
        url = self.NEWS_URL.format(self.BLOB)
        yield scrapy.Request(url, callback=self.start_crawl)

    def start_crawl(self, response):
        # now that we have cookies set,
        # start crawling form the first page
        yield scrapy.Request(self._build_url(1), meta=dict(page=1))

    def parse(self, response):
        data = self._parse_page(response)

        # extract news from the current page
        for item in self._parse_news(data):
            yield item

        # Paginate if needed
        current_page = response.meta['page']
        total_results = int(data['totalResultNumber'])
        if total_results > (current_page * self.RESULTS_PER_PAGE):
            page = current_page + 1
            url = self._build_url(page)
            yield scrapy.Request(url, meta=dict(page=page))

    def _parse_news(self, data):
        for article in data["news"]:
            item = ReuterItem()
            item["href"] = article["href"]
            item["date"] = article["date"]
            item["headline"] = article["headline"]
            yield item

如何在路透社搜索的scrapy中增加被抓物品的数量

1 个答案: