如何设定优先次序

时间:2019-06-12 15:38:14

标签: python scrapy splash-screen scrapy-splash

尝试抓取网页,我需要设置优先级才能抓取顺序。现在,它想抓取每个url的所有页面1s,然后是所有页面2s,依此类推。但是我需要它来刮擦URL 1的所有页面和URL 2的所有页面,依此类推。我一直在尝试使用优先级来实现此目的,方法是将第一个url设置为最高优先级,这将是csv文件中的url数量。但是它不起作用,主要是因为我不能递减优先级值,因为它在for循环中,因此每次进入循环时,都会将优先级重置为原始数字,因此每次都相同,因此它们都具有相同的优先级。如何使优先级正常工作,以便按我想要的顺序抓取网址。

SplashSpider.py

class MySpider(Spider):

    # Name of Spider
    name = 'splash_spider'
    # getting all the url + ip address + useragent pairs then request them
    def start_requests(self):


        # get the file path of the csv file that contains the pairs from the settings.py
        with open(self.settings["PROXY_CSV_FILE"], mode="r") as csv_file:
           # requests is a list of dictionaries like this -> {url: str, ua: str, ip: str}
            requests = process_csv(csv_file)
            for i, req in enumerate(requests):
                x = len(requests) - i  # <- check here
                # Return needed url with set delay of 3 seconds
                yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
                        # Pair with user agent specified in csv file
                        headers={"User-Agent": req["ua"]},
                        # Sets splash_url to whatever the current proxy that goes with current URL  is instead of actual splash url
                        splash_url = req["ip"],
                        priority = x,
                        meta={'priority': x}
                        )



更新#1

 # Scraping function that will scrape URLs for specified information
    def parse(self, response):
        # Initialize item to function GameItem located in items.py, will be called multiple times
        item = GameItem()
        # Initialize saved_name
        saved_name = ""
        # Extract card category from URL using html code from website that identifies the category.  Will be outputted before rest of data
        item["Category"] = response.css("span.titletext::text").get()
        # For loop to loop through HTML code until all necessary data has been scraped
        for game in response.css("tr[class^=deckdbbody]"):
            # Initialize saved_name to the extracted card name
            saved_name  = game.css("a.card_popup::text").get() or saved_name
            # Now call item and set equal to saved_name and strip leading '\n' from output
            item["Card_Name"] = saved_name.strip()
            # Check to see if output is null, in the case that there are two different conditions for one card
            if item["Card_Name"] != None:
                # If not null than store value in saved_name
                saved_name = item["Card_Name"].strip()
            # If null then set null value to previous card name since if there is a null value you should have the same card name twice
            else:
                item["Card_Name"] = saved_name
            # Call item again in order to extract the condition, stock, and price using the corresponding html code from the website
            item["Condition"] = game.css("td[class^=deckdbbody].search_results_7 a::text").get()
            item["Stock"] = game.css("td[class^=deckdbbody].search_results_8::text").get()
            item["Price"] = game.css("td[class^=deckdbbody].search_results_9::text").get()
            if item["Price"] == None:
                item["Price"] = game.css("td[class^=deckdbbody].search_results_9 span[style*='color:red']::text").get()

            # Return values
            yield item


        priority = response.meta['priority']
        # Finds next page button
        next_page = response.xpath('//a[contains(., "- Next>>")]/@href').get()
        # If it exists and there is a next page enter if statement
        if next_page is not None:
            # Go to next page
            yield response.follow(next_page, self.parse, priority=priority, meta={'priority': priority})

更新2

2019-06-13 15:16:23 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.starcitygames.com/catalog/category/1014?&start=50> (referer: http://www.starcitygames.com/catalog/category/Visions)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/home/north/scrapy_splash/scrapy_javascript/scrapy_javascript/spiders/SplashSpider.py", line 104, in parse
    priority = response.meta['priority']
KeyError: 'priority'

1 个答案:

答案 0 :(得分:1)

要通过数组更改它们,最好执行以下操作:

   for i, req in enumerate(requests):
        x = len(requests) - i  # <- check here

        # Return needed url with set delay of 3 seconds
        yield SplashRequest(url=req["url"], callback=self.parse, args={"wait": 3},
                # Pair with user agent specified in csv file
                headers={"User-Agent": req["ua"]},
                # Sets splash_url to whatever the current proxy that goes with current URL  is instead of actual splash url
                splash_url = req["ip"],
                priority = x,
                meta={'priority': x}  # <- check here!!
                )

例如,不要忘记使用meta传递当前优先级(我不记得是否有可能从响应中获取它)将其传递给每个子级请求。

更新:

    def parse(self, response):
        # I skip you logic here
        priority = response.meta['priority']
        next_page = response.xpath('//a[contains(., "- Next>>")]/@href').get()
        # If it exists and there is a next page enter if statement
        if next_page is not None:
            # Go to next page
            yield response.follow(next_page, self.parse, priority=priority, meta={'priority': priority})