如何在网页抓取期间找到所有隐藏的标签页href?

时间:2020-10-30 18:46:54

标签: python json python-3.x web-scraping

this website的右侧,有几个选项卡,其中包含要查看的文档。

Tabs Found on Right-Hand of Webpage

基础代码是一个带有部分href的标记,用于链接文档位置。我一直在尝试获取所有这些文档(通常以URL'/ documents /'开头),但没有成功。

enter image description here

当我抓取时,我似乎只抓住了“听力文档”表中的一个选项卡中找到的第一组文档。我分享了一段代码,试图在此页面中获取所有href。

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.jud11.flcourts.org/Judge-Details?judgeid=1063&sectionid=2')
soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.find_all("a"):
    if link.has_attr('href'):
        print(link['href'])

其中输出仅是第一个标签中的文档(在这种情况下),我共享一个代码段:

#collapse1
#collapse2
/documents/judges_forms/1062458802-Ex%20Parte%20Motions%20to%20Compel%20Discovery.pdf
/documents/judges_forms/1062459053-JointCaseMgtReport121.pdf
#collapse4
#collapse6

有人知道如何获得同一页面中存在的以下内容(我在下面列出)吗? (我想说的是,使用浏览器上的“检查元素”功能对此进行确认,但不会显示出来。您必须转到“听力文件”表格标签,然后检查元素)

/documents/judges_forms/1422459010-Order%20Granting%20Motion%20to%20Withdraw.docx

/documents/judges_forms/1422459046-ORDER%20ON%20Attorneys%20Fees.docx

感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

您可以使用此示例从其他选项卡获取指向文档的链接:

li.appendChild(document.createTextNode(new Date().toLocaleDateString(cityList[e])));

打印:

import requests
from bs4 import BeautifulSoup


url = 'https://www.jud11.flcourts.org/Judge-Details?judgeid=1063&sectionid=2'
headers = {'X-MicrosoftAjax': 'Delta=true',
           'X-Requested-With': 'XMLHttpRequest'}

with requests.session() as s:

    soup = BeautifulSoup(s.get(url).content, 'html.parser')

    data = {}
    for i in soup.select('input[name]'):
        data[i['name']] = i.get('value', '')

    for page in range(0, 6):
        print('Tab no.{}..'.format(page))
        data['ScriptManager'] = "ScriptManager|dnn$ctr1843$View$rtSectionHearingTypes"
        data['__EVENTARGUMENT'] = '{"type":0,"index":"' + str(page) + '"}'
        data['__EVENTTARGET'] ="dnn$ctr1843$View$rtSectionHearingTypes"
        data['dnn_ctr1843_View_rtSectionHearingTypes_ClientState'] = '{"selectedIndexes":["' + str(page) + '"],"logEntries":[],"scrollState":{}}'
        data['__ASYNCPOST'] = "true"
        data['RadAJAXControlID'] = "dnn_ctr1843_View_RadAjaxManager1"

        soup = BeautifulSoup( s.post(url, headers=headers, data=data).content, 'html.parser' )
        for a in soup.select('a[href*="documents"]'):
            print('https://www.jud11.flcourts.org' + a['href'])