我正在尝试抓取以下页面:https://erchackney.churchinsight.com/Media/AllMedia.aspx以下载网站上的所有讲道。
由于URL不允许,我不确定如何在页面之间切换。
我检查了发送到页面的请求标头,这是我看到的:
ctl00$ctl00$cphBody$cphSharedContents$hidden_sort:
ctl00$ctl00$cphBody$cphSharedContents$hidden_page: 1
ctl00$ctl00$cphBody$cphContents$txt_Search:
ctl00$ctl00$cphBody$cphContents$ddl_Speaker: All
ctl00$ctl00$cphBody$cphContents$ddl_BibleBook: 0
ctl00$ctl00$cphBody$cphContents$txt_BibleChapter:
如何将其集成到脚本中,以便选择最终进入的页面?
到目前为止,这是我的代码,无论发生什么,我总是最终进入第1页:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import ssl, os
url = 'https://erchackney.churchinsight.com/Media/AllMedia.aspx'
context = ssl._create_unverified_context()
print('Attempting to parse site:', url)
q = Request(url)
q.add_header('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11')
q.add_header(
'Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
q.add_header('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3')
q.add_header('Accept-Encoding', 'none')
q.add_header('Accept-Language', 'en-US,en;q=0.8')
q.add_header('Connection', 'keep-alive')
q.add_header('ctl00$ctl00$cphBody$cphSharedContents$hidden_page', 5)
html = urlopen(q, context=context).read()
filename_list = list()
if os.path.exists("sermons.txt"):
os.remove("sermons.txt")
if html:
soup = BeautifulSoup(html, "html.parser")
tables = soup.findAll("table", {"class": "bottomseparator"})
for table in tables:
td_list = table.findAll("td")
name = td_list[0].text.strip().replace(" ", "_")
speaker = td_list[1].text.strip().replace(" ", "_").strip()
category = td_list[4].text.strip().replace(" ", "_").strip()
temp_date = td_list[8].text.split("/")
date = temp_date[2] + temp_date[1] + temp_date[0]
if td_list[12]:
reference = td_list[12].text.strip().replace(" ", "_").strip()
filename = date + "~" + category + "~" + name + "~" + speaker + "~" + reference + ".mp3"
else:
filename = date + "~" + category + "~" + name + "~" + speaker + "~" + ".mp3"
print("Adding " + filename)
filename_list.append(filename)
else:
print("Could not parse site")