我有一张桌子,我希望拿起所有链接,浏览链接并刮掉td class = horse中的项目。
表格包含所有链接的主页包含以下代码:
<table border="0" cellspacing="0" cellpadding="0" class="full-calendar">
<tr>
<th width="160"> </th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=NSW">NSW</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=VIC">VIC</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=QLD">QLD</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=WA">WA</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=SA">SA</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=TAS">TAS</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=ACT">ACT</a></th>
<th width="105"><a href="/FreeFields/Calendar.aspx?State=NT">NT</a></th>
</tr>
<tr class="rows">
<td>
<p><span>FRIDAY 13 JAN</span></p>
</td>
<td>
<p>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a><br>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a><br>
</p>
</td>
<td>
<p>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a><br>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a><br>
</p>
</td>
<td>
<p>
<a href="/FreeFields/Form.aspx?Key=2017Jan13,QLD,Doomben">Doomben</a><br>
</p>
</td>
我目前有代码查找表并打印链接
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
#path to chromedriver
path_to_chromedriver = '/Users/Kirsty/Downloads/chromedriver'
#ensure browser is set to Chrome
browser = webdriver.Chrome(executable_path= path_to_chromedriver)
#set browser to Racing Australia Home Page
url = 'http://www.racingaustralia.horse/'
r = requests.get(url)
soup=BeautifulSoup(r.content, "html.parser")
#looks up to find the table & prints link for each page
table = soup.find('table',attrs={"class" : "full-calendar"}). find_all('a')
for link in table:
print link.get('href')
想知道是否有人可以协助我如何让代码点击表格中的所有链接&amp;对每个页面执行以下操作
g data = soup.findall("td",{"class":"horse"})
for item in g_data:
print item.text
提前致谢
答案 0 :(得分:0)
import requests, bs4, re
from urllib.parse import urljoin
start_url = 'http://www.racingaustralia.horse/'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/FreeFields/"))
links = [urljoin(start_url, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
tds = soup.find_all('td', class_="horse")
if not tds:
print(link, 'do not find hours tag')
else:
for td in tds:
print(td.text)
if __name__ == '__main__':
links = get_links(start_url)
for link in links:
get_tds(link)
出:
http://www.racingaustralia.horse/FreeFields/GroupAndListedRaces.aspx do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=NSW do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=VIC do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=QLD do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=WA do not find hours tag
.......
WEARETHECHAMPIONS
STORMY HORIZON
OUR RED JET
SAPPER TOM
MY COUSIN BOB
ALL TOO HOT
SAGA DEL MAR
ZIGZOFF
SASHAY AWAY
SO SHE IS
MILADY DUCHESS
bs4 +请求可以满足您的需求。