Question

我正在抓取Merriam-Webster网站上的所有单词。

我想抓取所有从a-z开始的页面以及其中的所有页面，并将它们保存到文本文件中。我遇到的问题是我只会得到表的第一个结果，而不是全部。我知道这是大量文本（大约50万），但是我正在这样做是为了教育自己。

代码：

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.merriam-webster.com/browse/dictionary/a/'

page = 1
# for page in range(1, 75):

req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
containers = soup.find('div', attrs={'class', 'entries'})
table = containers.find_all('ul')

for entries in table:
    links = entries.find_all('a')
    name = links[0].text
    print(name)

现在，我要从此表中获取所有条目，但我只获得了第一个条目。

我有点卡在这里，所以我们将不胜感激。谢谢

https://www.merriam-webster.com/browse/medical/a-z
https://www.merriam-webster.com/browse/legal/a-z
https://www.merriam-webster.com/browse/dictionary/a-z
https://www.merriam-webster.com/browse/thesaurus/a-z

Answer 1

我认为您需要另一个循环：

for entries in table:
    links = entries.find_all('a')
    for name in links:
        print(name.text)

Answer 2

要获取所有条目，可以使用以下示例：

import requests
from bs4 import BeautifulSoup


url = 'https://www.merriam-webster.com/browse/dictionary/a/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for a in soup.select('.entries a'):
    print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))

打印：

(a) heaven on earth            https://www.merriam-webster.com/dictionary/%28a%29%20heaven%20on%20earth
(a) method in/to one's madness https://www.merriam-webster.com/dictionary/%28a%29%20method%20in%2Fto%20one%27s%20madness
(a) penny for your thoughts    https://www.merriam-webster.com/dictionary/%28a%29%20penny%20for%20your%20thoughts
(a) quarter after              https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20after
(a) quarter of                 https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20of
(a) quarter past               https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20past
(a) quarter to                 https://www.merriam-webster.com/dictionary/%28a%29%20quarter%20to
(all) by one's lonesome        https://www.merriam-webster.com/dictionary/%28all%29%20by%20one%27s%20lonesome
(all) choked up                https://www.merriam-webster.com/dictionary/%28all%29%20choked%20up
(all) for the best             https://www.merriam-webster.com/dictionary/%28all%29%20for%20the%20best
(all) in good time             https://www.merriam-webster.com/dictionary/%28all%29%20in%20good%20time

...and so on.

要抓取多个页面：

url = 'https://www.merriam-webster.com/browse/dictionary/a/{}'

for page in range(1, 76):
    soup = BeautifulSoup(requests.get(url.format(page)).content, 'html.parser')
    for a in soup.select('.entries a'):
        print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))

编辑：要将所有页面从A到Z：

import requests
from bs4 import BeautifulSoup


url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}'

for char in range(ord('a'), ord('z')+1):
    page = 1
    while True:
        soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser')
        for a in soup.select('.entries a'):
            print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))

        last_page = soup.select_one('[aria-label="Last"]')['data-page']
        if last_page == '':
            break

        page += 1

编辑2：要保存到文件：

import requests
from bs4 import BeautifulSoup


url = 'https://www.merriam-webster.com/browse/dictionary/{}/{}'


with open('data.txt', 'w') as f_out:
    for char in range(ord('a'), ord('z')+1):
        page = 1
        while True:
            soup = BeautifulSoup(requests.get(url.format(chr(char), page)).content, 'html.parser')
            for a in soup.select('.entries a'):
                print('{:<30} {}'.format(a.text, 'https://www.merriam-webster.com' + a['href']))

                print('{}\t{}'.format(a.text, 'https://www.merriam-webster.com' + a['href']), file=f_out)

            last_page = soup.select_one('[aria-label="Last"]')['data-page']
            if last_page == '':
                break

            page += 1

Python Web抓取多个页面

2 个答案: