在Python中抓取aspx.net页面(在页面之间交替)

时间:2018-12-11 19:07:13

标签: python

我正在尝试抓取以下页面:https://erchackney.churchinsight.com/Media/AllMedia.aspx以下载网站上的所有讲道。

由于URL不允许,我不确定如何在页面之间切换。

我检查了发送到页面的请求标头,这是我看到的:

ctl00$ctl00$cphBody$cphSharedContents$hidden_sort: 
ctl00$ctl00$cphBody$cphSharedContents$hidden_page: 1
ctl00$ctl00$cphBody$cphContents$txt_Search: 
ctl00$ctl00$cphBody$cphContents$ddl_Speaker: All
ctl00$ctl00$cphBody$cphContents$ddl_BibleBook: 0
ctl00$ctl00$cphBody$cphContents$txt_BibleChapter: 

如何将其集成到脚本中,以便选择最终进入的页面?

到目前为止,这是我的代码,无论发生什么,我总是最终进入第1页

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import ssl, os

url = 'https://erchackney.churchinsight.com/Media/AllMedia.aspx'
context = ssl._create_unverified_context()
print('Attempting to parse site:', url)
q = Request(url)
q.add_header('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) '
             'Chrome/23.0.1271.64 Safari/537.11')
q.add_header(
    'Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
q.add_header('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3')
q.add_header('Accept-Encoding', 'none')
q.add_header('Accept-Language', 'en-US,en;q=0.8')
q.add_header('Connection', 'keep-alive')
q.add_header('ctl00$ctl00$cphBody$cphSharedContents$hidden_page', 5)
html = urlopen(q, context=context).read()

filename_list = list()

if os.path.exists("sermons.txt"):
  os.remove("sermons.txt")

if html:
    soup = BeautifulSoup(html, "html.parser")
    tables = soup.findAll("table", {"class": "bottomseparator"})
    for table in tables:
        td_list = table.findAll("td")
        name = td_list[0].text.strip().replace(" ", "_")
        speaker = td_list[1].text.strip().replace(" ", "_").strip()
        category = td_list[4].text.strip().replace(" ", "_").strip()
        temp_date = td_list[8].text.split("/")
        date = temp_date[2] + temp_date[1] + temp_date[0]

        if td_list[12]:
          reference = td_list[12].text.strip().replace(" ", "_").strip()
          filename = date + "~" + category + "~" + name + "~" + speaker + "~" + reference + ".mp3"
        else:
          filename = date + "~" + category + "~" + name + "~" + speaker + "~" + ".mp3"

        print("Adding " + filename)
        filename_list.append(filename)
else:
    print("Could not parse site")

0 个答案:

没有答案