无法处理网页中一些复杂的布局内容

时间:2018-08-06 10:37:42

标签: python python-3.x web-scraping beautifulsoup

我已经用Python与BeautifulSoup结合编写了一个脚本,以解析网页中的某些内容。在登录页面中,有两个表。我需要使用第一个表中的Results标签,该标签会指向目标页面。

从目标页面开始,我只在此信息Grade: M 300 metres之后,依此类推。其上方有多个标签,例如1,2,3,4等,其状态分别为Grade。我希望全部拿走。

由于登录页面上的Results标签没有任何链接,我不得不使用post请求从目标页面获取内容。在这种情况下,浏览器模拟器不是我的选择。

最重要的是,我需要使用六个post请求才能到达六个Results标签的内容。

我下面粘贴的脚本可以处理最后一个results标签的内容。如何解决循环问题,以同时从所有Results标签中获取内容?

这是我的尝试:

import requests
from bs4 import BeautifulSoup

url = "https://www.thedogs.com.au/Racing/Results.aspx?SearchDate=3-Jun-2018"

def get_info(session,link):
    session.headers['User-Agent'] = "Mozilla/5.0"
    res = session.get(link)
    soup = BeautifulSoup(res.text,"lxml")

    formdata = {}

    for items in soup.select("#aspnetForm input"):
        if "ctl00$ContentPlaceHolder1$rptrLatestRacingResults$ctl" in items.get("name"):continue
        if "ctl00$ContentPlaceHolder1$rptrSearchResults$ctl0" in items.get("name"):
            formdata[items.get("name")] = items.get("value")
        else:
            formdata[items.get("name")] = items.get("value")

    session.headers['User-Agent'] = "Mozilla/5.0"
    req = session.post(link,data = formdata)
    soup = BeautifulSoup(req.text,"lxml")
    for iteminfo in soup.select("[id^='ctl00_ContentPlaceHolder1_tabContainerRaces_tabRace'] span"):
        if "Grade:" in iteminfo.text:
            print(iteminfo.text)

if __name__ == '__main__':
    with requests.Session() as session:
        get_info(session,url)

请参见下面的两张图片(一张一张一张),以识别我要获取的内容:

Enter image description here

Enter image description here

1 个答案:

答案 0 :(得分:3)

您可以利用CSS选择器span[id$=lblResultsRaceName]来查找ID以lblResultsRaceName结尾的所有范围,而'td > span'则找到具有直接父项<td>的所有范围:

此代码段将显示所有比赛结果并显示所有比赛:

import requests
from bs4 import BeautifulSoup

url = "https://www.thedogs.com.au/Racing/Results.aspx?SearchDate=3-Jun-2018"

def get_info(session,link):
    session.headers['User-Agent'] = "Mozilla/5.0"
    res = session.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    formdata = {i['name']: i['value'] for i in soup.select('input[type=hidden]')}
    for race_name, i in  zip(soup.select('span[id$=lblResultsRaceName]'), soup.select('input[id$=btnViewResults]')):
        print(race_name.text.strip())
        formdata[i['name']] = 'Results'
        req = session.post(link,data = formdata)
        soup = BeautifulSoup(req.text,"lxml")
        for panel in soup.select('div[id^=ctl00_ContentPlaceHolder1_tabContainerRaces_tabRace]'):
            print(panel.select('td > span')[0].text.strip(), panel.select('td > span')[1].text.strip())
        print('#' * 80)

if __name__ == '__main__':
    with requests.Session() as session:
        get_info(session,url)

打印:

Healsville
Race 1 Grade:  M   300 metres
Race 2 Grade:  M   350 metres
Race 3 Grade:  6/7   350 metres
Race 4 Grade:  R/W   300 metres
Race 5 Grade:  5   350 metres
Race 6 Grade:  SE   350 metres
Race 7 Grade:  4/5   350 metres
Race 8 Grade:  SE   350 metres
Race 9 Grade:  7   300 metres
Race 10 Grade:  6/7   300 metres
Race 11 Grade:  4/5   300 metres
Race 12 Grade:  5   300 metres
################################################################################
Sale
Race 1 Grade:  M   440 metres
Race 2 Grade:  M   440 metres
Race 3 Grade:  R/W   520 metres
Race 4 Grade:  7   440 metres
Race 5 Grade:  R/W   440 metres
Race 6 Grade:  4/5   520 metres
Race 7 Grade:  R/W   440 metres
Race 8 Grade:  4/5   440 metres
Race 9 Grade:  6/7   440 metres
Race 10 Grade:  R/W   440 metres
Race 11 Grade:  R/W   440 metres
Race 12 Grade:  5   520 metres
################################################################################
...and so on.