从多个网页抓取网址

时间:2020-05-28 11:15:30

标签: html python-3.x web-scraping beautifulsoup

我试图从多个网页(在本例中为2)中提取URL,但是由于某种原因,我的输出是从第一页提取的URL的重复列表。我在做什么错了?

我的代码:

# URLs of books in scope
urls = []
for pn in range(2):
    baseUrl = 'https://www.goodreads.com'
    path = '/shelf/show/bestsellers?page='+str(pn+1)
    page = requests.get(baseUrl + path).text
    print(baseUrl+path)
    soup = BeautifulSoup(page, "html.parser")
    for link in soup.findAll('a',attrs={'class':"leftAlignedImage"}):
        if link['href'].startswith('/author/show/'):
            pass
        else:
            u=baseUrl+link['href']
            urls.append(u)
for u in urls:
    print(u)

输出:

https://www.goodreads.com/shelf/show/bestsellers?page=1
https://www.goodreads.com/shelf/show/bestsellers?page=2
https://www.goodreads.com/book/show/5060378-the-girl-who-played-with-fire
https://www.goodreads.com/book/show/968.The_Da_Vinci_Code
https://www.goodreads.com/book/show/4667024-the-help
https://www.goodreads.com/book/show/2429135.The_Girl_with_the_Dragon_Tattoo
https://www.goodreads.com/book/show/3.Harry_Potter_and_the_Sorcerer_s_Stone
.
.
.
https://www.goodreads.com/book/show/4588.Extremely_Loud_Incredibly_Close
https://www.goodreads.com/book/show/36809135-where-the-crawdads-sing
.
.
.
https://www.goodreads.com/book/show/4588.Extremely_Loud_Incredibly_Close
https://www.goodreads.com/book/show/36809135-where-the-crawdads-sing

1 个答案:

答案 0 :(得分:1)

您将获得重复的URL,因为两次都加载相同的页面。即使您设置为page=2,即使您未登录,该网站也仅显示畅销书的首页

要解决此问题,您将必须修改代码以在加载页面之前先登录,或者传递必须从登录的浏览器导入的cookie。