无法抓取表格以获取 href 链接

时间:2021-03-03 04:00:02

标签: python web-scraping beautifulsoup

我正在尝试从表中拉出 href 链接,稍后我需要逐个单击该链接以访问每个链接中的数据。但我想不出一种方法来做到这一点。我尝试过 find_all 并得到“ResultSet 对象没有属性 '%s' 错误。

HTML:(真的很长,所以这是一个 1/10)

<thead>
<tr class="sctablehead">
<th>Academic Program</th>
<th>Departments</th>
<th>Academic Level</th>
<th>College</th>
<th>Online</th>
<th>Degree Type</th>
</tr>
</thead>
<tbody>
<tr class="even firstrow"><td><a href="/graduate/graduate-programs/master-accountancy/">Accountancy</a></td><td>Accounting</td><td>Graduate</td><td>BUS</td><td></td><td>MAC</td></tr>
<tr class="odd"><td><a href="/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/bsba-in-accounting/">Accounting</a></td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td><a href="/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/accounting-minor/">Accounting</a></td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td><a href="/undergraduate/colleges-programs/college-science-technology-engineering-mathematics/department-mathematics-statistics/actuarial-science-minor/">Actuarial Science</a></td><td>Mathematics, Economics, Finance</td><td>Undergraduate</td><td>STEM</td><td></td><td>Minor</td></tr>
<tr class="even"><td><a href="/graduate/graduate-programs/post-masters-adult-gero-acute-care-nurse-pract-certificate-program/">Adult Gerontology Acute Care Nurse Practitioner</a></td><td>Nursing</td><td>Graduate</td><td>HHS</td><td></td><td>PMC</td></tr>
<tr class="odd"><td><a href="/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations/">Advertising and Public Relations</a></td><td>Advertising</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td><a href="/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations-minor/">Advertising Public Relations</a></td><td>Marketing</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td><a href="/undergraduate/colleges-programs/college-health-human-services/aerospace-studies-program/">Aerospace Studies</a></td><td>Aerospace Studies</td><td>Undergraduate</td><td>HHS</td><td></td><td>Minor</td></tr>
<tr class="even"><td><a href="/undergraduate/colleges-programs/college-liberal-arts-social-sciences-education/department-africana-studies-minor/">Africana Studies</a></td><td>Africana Studies</td><td>Undergraduate</td><td>BCLASSE</td><td></td><td>Minor</td></tr>

...等等

我的代码:

r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})

for tr in programs_table.find_all('tr class'):
    for a in tr.find_all('a'):
        print(a['href'])

2 个答案:

答案 0 :(得分:0)

如果您的表格被正确找到(因为您没有为此提供 html..) 然后ONLY:-

r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})

for tr in programs_table.find_all('tr'):
    for a in tr.find_all('a'):
        print(a['href'])

换句话说,您可以尝试使用 programs_table.find_all("tr") 而不是 programs_table.find_all("tr class")

因为我使用这个后得到的结果如下:

/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/bsba-in-accounting/
/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/accounting-minor/
/undergraduate/colleges-programs/college-science-technology-engineering-mathematics/department-mathematics-statistics/actuarial-science-minor/
/graduate/graduate-programs/post-masters-adult-gero-acute-care-nurse-pract-certificate-program/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations-minor/
/undergraduate/colleges-programs/college-health-human-services/aerospace-studies-program/```

答案 1 :(得分:0)

首先,您不应该使用 find_all 来抓取一个标签,除非您真的希望它在列表中。因此,要获得表格,您应该这样做:

programs_table = soup.find('table', class_="sc_sctable")

现在要获取带有 <a> 链接的内部 href 标签,您可以抓取具有内部 <td> 标签的 <a> 标签:

tags_with_href = programs_table.tbody.find_all('td')
links = [each_tag.a['href'] for each_tag in tags_with_href if each_tag.a]
# -> ['/graduate/graduate-programs/master-accountancy/', ... ]

如果你想拥有绝对网址而不是相对网址,你可以定义base_url并将每个相对网址添加到其中:

base_url = '<base_url_of_website>'
links = [base_url + link for link in links]