刮取表格链接,点击链接&抓数据

时间:2017-01-12 23:29:25

标签: python selenium beautifulsoup python-requests

我有一张桌子,我希望拿起所有链接,浏览链接并刮掉td class = horse中的项目。

表格包含所有链接的主页包含以下代码:

  <table border="0" cellspacing="0" cellpadding="0" class="full-calendar">
    <tr>
        <th width="160">&nbsp;</th>
        <th width="105"><a href="/FreeFields/Calendar.aspx?State=NSW">NSW</a></th>
        <th width="105"><a href="/FreeFields/Calendar.aspx?State=VIC">VIC</a></th>
        <th width="105"><a href="/FreeFields/Calendar.aspx?State=QLD">QLD</a></th>
        <th width="105"><a href="/FreeFields/Calendar.aspx?State=WA">WA</a></th>
        <th width="105"><a href="/FreeFields/Calendar.aspx?State=SA">SA</a></th>
        <th width="105"><a href="/FreeFields/Calendar.aspx?State=TAS">TAS</a></th>
        <th width="105"><a href="/FreeFields/Calendar.aspx?State=ACT">ACT</a></th>
        <th width="105"><a href="/FreeFields/Calendar.aspx?State=NT">NT</a></th>
    </tr>  


    <tr class="rows">
        <td>
            <p><span>FRIDAY 13 JAN</span></p>
        </td>

                <td>
                    <p>

                            <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a><br>

                            <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a><br>

                    </p>
                </td>

                <td>
                    <p>

                            <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a><br>

                            <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a><br>

                    </p>
                </td>

                <td>
                    <p>

                            <a href="/FreeFields/Form.aspx?Key=2017Jan13,QLD,Doomben">Doomben</a><br>

                    </p>
                </td>

我目前有代码查找表并打印链接

from selenium import webdriver
import requests 
from bs4 import BeautifulSoup

#path to chromedriver 
path_to_chromedriver = '/Users/Kirsty/Downloads/chromedriver'

#ensure browser is set to Chrome 
browser = webdriver.Chrome(executable_path= path_to_chromedriver)

#set browser to Racing Australia Home Page
url = 'http://www.racingaustralia.horse/'
r = requests.get(url)

soup=BeautifulSoup(r.content, "html.parser")

#looks up to find the table & prints link for each page
table = soup.find('table',attrs={"class" : "full-calendar"}). find_all('a')
 for link in table:
        print link.get('href')

想知道是否有人可以协助我如何让代码点击表格中的所有链接&amp;对每个页面执行以下操作

g data = soup.findall("td",{"class":"horse"})
for item in g_data:
   print item.text

提前致谢

1 个答案:

答案 0 :(得分:0)

import requests, bs4, re
from urllib.parse import urljoin
start_url = 'http://www.racingaustralia.horse/'

def make_soup(url):
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text, 'lxml')
    return soup

def get_links(url):
    soup = make_soup(url)
    a_tags = soup.find_all('a', href=re.compile(r"^/FreeFields/"))
    links = [urljoin(start_url, a['href'])for a in a_tags]  # convert relative url to absolute url
    return links

def get_tds(link):
    soup = make_soup(link)
    tds = soup.find_all('td',  class_="horse")
    if not tds:
        print(link, 'do not find hours tag')
    else:
        for td in tds:
            print(td.text)

if __name__ == '__main__':
    links = get_links(start_url)
    for link in links:
        get_tds(link)

出:

http://www.racingaustralia.horse/FreeFields/GroupAndListedRaces.aspx do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=NSW do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=VIC do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=QLD do not find hours tag
http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=WA do not find hours tag
.......

WEARETHECHAMPIONS 
STORMY HORIZON 
OUR RED JET 
SAPPER TOM 
MY COUSIN BOB 
ALL TOO HOT 
SAGA DEL MAR 
ZIGZOFF 
SASHAY AWAY 
SO SHE IS 
MILADY DUCHESS 

bs4 +请求可以满足您的需求。