Question

我从网站（https://www.gmcameetings.co.uk）上抓取了一组链接-所有链接，包括“会议”一词，即会议文件，现在都包含在“ meeting_links”中。我现在需要关注它们中的每个链接，以在其中刮取更多链接。

我已经回到使用请求库并尝试了

r2 = requests.get("meeting_links")

但是它返回以下错误：

MissingSchema: Invalid URL 'list_meeting_links': No schema supplied. 
Perhaps you meant http://list_meeting_links?

我将其更改为，但仍然没有区别。

到目前为止，这是我的代码，以及如何从所需的第一个URL获得链接。

# importing libaries and defining
import requests
import urllib.request
import time 
from bs4 import BeautifulSoup as bs

# set url
url = "https://www.gmcameetings.co.uk/" 

# grab html 
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')

# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'

# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
    print(link['href'])
    if link['href'].find('/meetings/')>1:
        print("Meeting!") 

#second set of links
r2 = requests.get("meeting_links")

在重新开始使用请求库之前，我需要对'meeting_links'做些什么吗？我完全迷路了。

Answer 1

据我了解，您的新要求可能在这里：

id="up_button"

因为看起来您正在尝试将字符串传递给requests方法。请求方法应如下所示：

for link in meeting_links:
    if link['href'].find('/meetings/')>1:
        r2 = requests.get(link['href']) 

        <Do something with the request>

如何使用请求库对已经抓取的链接列表进行网络抓取

1 个答案: