如何确保在我的Scrapy Spider中解析了每个URL

时间:2019-02-01 06:27:40

标签: python beautifulsoup scrapy web-crawler

我正尝试在美食博客上抓取食谱清单的每一页,在每一页上抓取食谱URL,并将它们全部写入单个.txt文件。就我的代码而言,它可以正常运行,但仅适用于urls方法内start_requests内列出的第一个URL。

我添加了.log(),以检查urls确实包含我要从中抓取的所有正确URL,并且当我在命令提示符下执行Scrapy时,我得到以下确认:他们在那里:

2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=1
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=2
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=3
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=4
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=5
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=6
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=7
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=8
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=9
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=10
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=11
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=12
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=13
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=14
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=15
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=16
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=17
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=18
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=19
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=20

我当前的代码:

import scrapy
from bs4 import BeautifulSoup


class QuotesSpider(scrapy.Spider):
    name = "recipes"

    def start_requests(self):
        urls = []
        for i in range (1, 60):
            curr_url = "https://pinchofyum.com/recipes?fwp_paged=%s" % i
            self.log(curr_url)
            urls.append(curr_url)
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        soup = BeautifulSoup(response.body, "html.parser")
        page_links = soup.find_all(class_="post-summary")    
        for link in page_links:
            with open("links.txt", "a") as f:
                f.write(link.a["href"] + "\n")

运行上述命令时,我将以下输出写入links.txt:

https://pinchofyum.com/5-minute-vegan-yogurt
https://pinchofyum.com/red-curry-noodles
https://pinchofyum.com/15-minute-meal-prep-cauliflower-fried-rice-with-crispy-tofu
https://pinchofyum.com/5-ingredient-vegan-vodka-pasta
https://pinchofyum.com/lentil-greek-salads-with-dill-sauce
https://pinchofyum.com/coconut-oil-granola-remix
https://pinchofyum.com/quinoa-crunch-salad-with-peanut-dressing
https://pinchofyum.com/15-minute-meal-prep-cilantro-lime-chicken-and-lentils
https://pinchofyum.com/instant-pot-sweet-potato-tortilla-soup
https://pinchofyum.com/garlic-butter-baked-penne
https://pinchofyum.com/15-minute-meal-prep-creole-chicken-and-sausage
https://pinchofyum.com/lemon-chicken-soup-with-orzo
https://pinchofyum.com/brussels-sprouts-tacos
https://pinchofyum.com/14-must-bake-holiday-cookie-recipes
https://pinchofyum.com/how-to-cook-chicken

这里的链接是正确的,但是应该有50多个页面。

有什么建议吗?我想念什么?

1 个答案:

答案 0 :(得分:0)

我了解到的是,您想确保urls中的每个页面都被成功刮取并包含链接,如果可以,请参见下面的代码

import scrapy
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class QuotesSpider(scrapy.Spider):
    name = "recipes"
    urls = []

    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def start_requests(self):
        for i in range (1, 60):
            curr_url = "https://pinchofyum.com/recipes?fwp_paged=%s" % i
            self.log(curr_url)
            self.urls.append(curr_url)
            yield scrapy.Request(url=curr_url, callback=self.parse)

    def parse(self, response):
        page_links = response.css(".post-summary")   
        if len(page_links)>0:
            del self.urls[response.url] #delete from URLS to confirm that it has been parsed
            for link in page_links:
                with open("links.txt", "a") as f:
                    f.write(link.a["href"] + "\n")


    def spider_closed(self, spider):
        self.log("Following URLs were not parsed: %s"%(self.urls))

它的作用是将所有要抓取的网址附加到self.urls中,一旦抓取到一个URL并且其中包含链接,它就会从self.urls删除

请注意,还有另一种称为spider_closed的方法,该方法仅在刮板完成时才执行,因此它将打印未刮除的URL或其中没有链接的URL。

此外,为什么要使用BeautifulSoup?只需使用Python Scrapy的Selector类