Question

我是新来编写和使用python中的类的人。我已经使用类编写了一个解析器，以检查是否有next page方法生成的.get_nextpage()网址。但是，当.get_nextpage()方法产生链接时，则应在self.get_nextpage(soup)方法内try except block中.get_links()行的后面立即打印该链接。我对如何实现这一目标感到困惑。

我所追求的没有其他解决方案。我只想知道如果我可以尝试的逻辑。

我在while True方法中使用了.get_links()条件，因此它将一直运行，直到.get_nextpage()方法生成新链接为止。（It's not the part of this question. Just to let you know why I used "while True" there）

这是刮板：

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "https://stackoverflow.com/questions/tagged/web-scraping"

class StackOverflowClass(object):

    def __init__(self, link):
        self.url = link

    def get_links(self):
        while True:
            res = requests.get(self.url)
            soup = BeautifulSoup(res.text,"lxml")

            try:
                self.get_nextpage(soup)
                # what to do here to get the link generated within ".get_nextpage()" method
            except:break

    def get_nextpage(self,sauce):
        nurl = sauce.select_one("div.pager a[rel='next']")
        if nurl:
            link = urljoin(self.url,nurl.get("href"))

crawler = StackOverflowClass(url)
crawler.get_links()

要清楚我的意思，请再次查看以下几行：

try:
    self.get_nextpage(soup)
    # what to do here to get the link generated within ".get_nextpage()" method
except:break

Answer 1

您可以如下修改get_nextpage：

def get_nextpage(self,sauce):
    nurl = sauce.select_one("div.pager a[rel='next']")
    if nurl:
        link = urljoin(self.url,nurl.get("href"))
        return link

，然后您可以在get_links()中使用它来获取链接值：

def get_links(self):
    while True:
        res = requests.get(self.url)
        soup = BeautifulSoup(res.text,"lxml")

        if self.get_nextpage(soup):
            link = self.get_nextpage(soup)
            # do whatever you want with link
        else:break

请注意，在没有显式if返回else和{的情况下，使用try / except代替return / None作为方法/函数{1}}永远不会产生异常，也不会执行循环try: None

无法利用刮板中某个方法生成的链接

1 个答案: