Question

我写了一个脚本来刮掉quotes to scrape的引号和作者姓名。在这个项目中，我使用请求获取页面的代码，并使用bs4解析HTML。我使用while循环浏览分页链接到下一页，但是我希望在没有页面时停止运行我的代码。我的代码有效，但不会停止运行。

这是我的代码：

from bs4 import BeautifulSoup as bs
import requests

def scrape():
    page = 1
    url = 'http://quotes.toscrape.com'
    r = requests.get(url)
    soup = bs(r.text,'html.parser')
    quotes = soup.find_all('span',attrs={"class":"text"})
    authors = soup.find_all('small',attrs={"class":"author"})
    p_link = soup.find('a',text="Next")

    condition = True
    while condition:
        with open('quotes.txt','a') as f:
            for i in range(len(authors)):
                f.write(quotes[i].text+' '+authors[i].text+'\n')
        if p_link not in soup:
            condition = False
            page += 1
            url = 'http://quotes.toscrape.com/page/{}'.format(page)
            r = requests.get(url)
            soup = bs(r.text,'html.parser')
            quotes = soup.find_all('span',attrs={"class":"text"})
            authors = soup.find_all('small',attrs={"class":"author"})
            condition = True
        else:
            condition = False

    print('done')


scrape()

Answer 1

因为p_link从来没有喝汤。我发现有两个原因。

您使用文本“下一步”搜索它。但是似乎实际链接是“下一个” +空格+右箭头
标签包含指向下一页的属性“ href”。对于每个页面，这将具有不同的值。

在第一个if块的while循环内，条件为False也没有区别。无论如何，您都将其重新设置在块的末尾。

所以...

代替使用Next搜索，请使用：

soup.find('li',attrs={"class":"next"})

对于条件，请使用：

if soup.find('li',attrs={"class":"next"}) is None:
   condition = False

最后，如果您也想写最后一页的引号，建议您将“写到文件”部分放在最后。或完全避免 ..像这样：

from bs4 import BeautifulSoup as bs
import requests

def scrape():
    page = 1
    while True:

        if page == 1:
            url = 'http://quotes.toscrape.com'
        else:
            url = 'http://quotes.toscrape.com/page/{}'.format(page)

        r = requests.get(url)
        soup = bs(r.text,'html.parser')

        quotes = soup.find_all('span',attrs={"class":"text"})
        authors = soup.find_all('small',attrs={"class":"author"})

        with open('quotes.txt','a') as f:
            for i in range(len(authors)):
                f.write(str(quotes[i].encode("utf-8"))+' '+str(authors[i].encode("utf-8"))+'\n')       

        if soup.find('li',attrs={"class":"next"}) is None:
            break

        page+=1

    print('done')


scrape()

如何使爬虫使用bs4抓取网站

1 个答案: