我从网站(https://www.gmcameetings.co.uk)上抓取了一组链接-所有链接,包括“会议”一词,即会议文件,现在都包含在“ meeting_links”中。我现在需要关注它们中的每个链接,以在其中刮取更多链接。
我已经回到使用请求库并尝试了
r2 = requests.get("meeting_links")
但是它返回以下错误:
MissingSchema: Invalid URL 'list_meeting_links': No schema supplied.
Perhaps you meant http://list_meeting_links?
我将其更改为,但仍然没有区别。
到目前为止,这是我的代码,以及如何从所需的第一个URL获得链接。
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
#second set of links
r2 = requests.get("meeting_links")
在重新开始使用请求库之前,我需要对'meeting_links'做些什么吗?我完全迷路了。
答案 0 :(得分:2)
据我了解,您的新要求可能在这里:
id="up_button"
因为看起来您正在尝试将字符串传递给requests方法。 请求方法应如下所示:
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
<Do something with the request>