Question

我正在尝试使用BeautifulSoup从网站上抓取单词列表。刮取第一页很容易，但是要获得所有页面，我必须获取每个页面的页码（准确地为字符串），这对我来说很难，因为它们不是从传统的{1-100}或{a-z}开始的，每个页面都不同。

例如，this是/a/类别中其余页面存储所有链接的页面。通常，它们类似于a/1，a/2，a/3，但在这种情况下，它们是：

https://dictionary.cambridge.org/browse/english/a/a
https://dictionary.cambridge.org/browse/english/a/a-conflict-of-interest
https://dictionary.cambridge.org/browse/english/a/a-hard-tough-row-to-hoe
and so on...all the way to /english/z/{}

我的代码：

import requests
from bs4 import BeautifulSoup as bs

url = 'https://dictionary.cambridge.org/browse/english/a/a/'
head = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
# regex = 'idiom$'

with open('output.txt', 'w', encoding="utf-8") as f_out:

    soup = bs(requests.get(url,headers={'User-Agent': head}).content, 'html.parser')
    div = soup.find('div', attrs={'class', 'hdf ff-50 lmt-15'})
    span = div.find_all('a')

    for text in span:

        text_str = text.text.strip()
        print(text_str)
        print('{}'.format(text_str), file=f_out)

它按预期方式获得了文本，但此后我不知道如何解析下一页。

Answer 1

您可以循环浏览字母，获取所有href属性，将其的最后一部分剪下来（即您的单词或表达式），然后将其保存到文件中。

方法如下：

import string

import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
}
letters = string.ascii_lowercase
main_url = "https://dictionary.cambridge.org/browse/english/"

for letter in letters:
    print(f"Fetching words for letter {letter.upper()}...")
    page = requests.get(f"{main_url}{letter}", headers=headers).content
    soup = BeautifulSoup(page, "html.parser").find_all("a", {"class": "dil tcbd"})
    with open(f"{letter}_words.txt", "w") as output:
        output.writelines(
            "\n".join(a["href"].split("/")[-2] for a in soup[1:]) + "\n"
        )

输出：每个字母的文件，例如字母a。

a-conflict-of-interest
a-hard-tough-row-to-hoe
a-meeting-of-minds
a-pretty-fine-kettle-of-fish
a-thing-of-the-past
ab-initio
abduction
abo
abreast
absolute-motion
absurdity
accent
accidental-death-benefit
account-for-sth
acct
acetylcholinesterase
ackee
acrobatics
actionable
actuarial
adapting
adduce
adjective
administration-order
adoration
adumbrated
advertised
aerie
affect
affronting
afters
agender
agit-pop
...

页码无序时如何刮取多页

1 个答案: