使用python 2.7和BS4

时间:2016-04-05 14:22:21

标签: python-2.7 beautifulsoup

我有一个软件可以将竞赛结果导出为htm文件。 我目前在excel中打开这个文件,并将结果表重新格式化为适合上传到mysql数据库的格式,其中使用php进行大量的数字运算。

我想自动重新格式化/解析/抓取,并在网上搜索继续建议python 2.7与beuatifulsoup4,我开始玩,但我没有使用过之前...

这是对原始表格结构的一种了解<

<h1>CLASS 1</h1>
    <table class="results">
        <tr><th nowrap="1">Name</th><th nowrap="1">Town</th><th nowrap="1">Bike</th><th nowrap="1">Penalty</th></tr>
        <tr><td class="rider">RIDER 1</td><td></td><td></td><td>01:14:20</td></tr>
        <tr><td colspan="7"><table class="laps"><tr><td>00:26:36</td><td>00:19:51</td><td>00:27:54</td></tr></table></td></tr>
        <tr><td class="rider">RIDER 2</td><td></td><td></td><td>00:41:06</td></tr>
        <tr><td colspan="7"><table class="laps"><tr><td>00:19:10</td><td>00:21:57</td></tr></table></td></tr>
        <tr><td class="rider">RIDER 3</td><td></td><td></td><td>00:36:59</td></tr>
        <tr><td colspan="7"><table class="laps"><tr><td>00:37:00</td></tr></table></td></tr>
        <tr><td class="rider">RIDER 4</td><td></td><td></td><td>01:26:41</td></tr>
        <tr><td colspan="7"><table class="laps"><tr><td>01:26:42</td></tr></table></td></tr>
    </table>
<h1>CLASS 2</h1>

我希望导出为每个类的一个表,其中包含eash rider在单行上的所有信息,如下所示...

NAME1 02:26:4 200:12:42 00:13:04 00:13:25 00:13:19 00:13:22 00:13:29 00:13:44
NAME2:02:41:06 00:13:17 00:14:10 00:13:40 00:13:38 00:13:47 00:13:12 00:13:24

在python中玩游戏我到目前为止使用beautifulsoup读取文件。

from bs4 import BeautifulSoup

with open(r'test.htm', "r") as f:
    pagebuffer = f.read()  
soup = BeautifulSoup(pagebuffer, "lxml")

在检查html之后,我能够在汤中搜索相关的类名。

riders = soup.find_all(class_="rider")
for item in riders:
    print item.text   

姓名1 姓名2
姓名3
姓名4
姓名5
姓名6
名称7
姓名8
姓名9
姓名10
姓名11
名称12
姓名13
姓名14
名称15
姓名16
姓名17
姓名18
姓名19

laps = soup.find_all(class_="laps")
for item in laps:
    print item.text  

00:12:4200:13:0400:13:2500:13:1900:13:2200:13:2900:13:4400:13:3000:13:2000:13:3800:13:10 <无线电通信/> 00:12:2600:13:1700:14:1000:13:4000:13:3800:13:4700:13:1200:13:2400:13:2500:13:4700:13:43
00:12:3100:13:1300:13:2200:13:5200:13:5500:14:0800:13:2500:13:4500:13:5300:13:4400:13:25
00:14:2300:14:2600:15:0100:14:5300:14:5800:14:3100:14:4400:15:3300:14:1900:14:14
00:13:5700:13:4800:14:1900:14:3200:14:5100:15:0300:14:3600:17:5700:14:4200:14:39
00:14:1100:14:3200:14:4300:14:2300:14:5900:14:4600:15:1000:15:0500:15:1400:16:13
00:13:4100:13:3200:14:0000:14:0100:14:3200:14:1000:14:3600:14:2100:28:5500:14:17
00:13:3000:13:3900:14:00


02:36:4900:13:2800:13:3700:13:5600:13:5700:14:4600:14:1700:14:2700:15:1800:14:3800:14:1100:14:15 <登记/> 02:36:5800:13:5900:13:4900:14:1900:14:1100:14:2300:14:2700:14:2700:14:2400:14:2600:14:1300:14:21 <登记/> 02:27:0100:14:2300:14:2600:15:0100:14:5300:14:5800:14:3100:14:4400:15:3300:14:1900:14:14
02:28:2300:13:5700:13:4800:14:1900:14:3200:14:5100:15:0300:14:3600:17:5700:14:4200:14:39
02:29:1500:14:1100:14:3200:14:4300:14:2300:14:5900:14:4600:15:1000:15:0500:15:1400:16:13
02:36:0400:13:4100:13:3200:14:0000:14:0100:14:3200:14:1000:14:3600:14:2100:28:5500:14:17
00:41:0800:13:3000:13:3900:14:00


这就是我被困的地方...... 1.我如何将这两个搜索(骑手,圈)组合在一起?
2.总时间不是由班级名称定义的,如何在班级骑手[td]标签之后搜索第3 [td]个标签?
3.我能在没有安装python和bs4的机器上制作这个可执行文件,还是应该查看其他编码方法?

这是指向典型htm文件的链接:http://www.kr3w.co.uk/downloads/test.htm

1 个答案:

答案 0 :(得分:0)

对于每位骑手,您需要使用find_next_sibling()获取下一行

完成实施:

from pprint import pprint

from bs4 import BeautifulSoup

data = """
<table class="results">
    <tr><th nowrap="1">Name</th><th nowrap="1">Town</th><th nowrap="1">Bike</th><th nowrap="1">Penalty</th></tr>
    <tr><td class="rider">RIDER 1</td><td></td><td></td><td>01:14:20</td></tr>
    <tr><td colspan="7"><table class="laps"><tr><td>00:26:36</td><td>00:19:51</td><td>00:27:54</td></tr></table></td></tr>
    <tr><td class="rider">RIDER 2</td><td></td><td></td><td>00:41:06</td></tr>
    <tr><td colspan="7"><table class="laps"><tr><td>00:19:10</td><td>00:21:57</td></tr></table></td></tr>
    <tr><td class="rider">RIDER 3</td><td></td><td></td><td>00:36:59</td></tr>
    <tr><td colspan="7"><table class="laps"><tr><td>00:37:00</td></tr></table></td></tr>
    <tr><td class="rider">RIDER 4</td><td></td><td></td><td>01:26:41</td></tr>
    <tr><td colspan="7"><table class="laps"><tr><td>01:26:42</td></tr></table></td></tr>
</table>
"""

soup = BeautifulSoup(data, "html.parser")
data = []
for rider in soup.select("td.rider"):  # all td elements having `rider` class
    rider_name = rider.get_text()
    # getting the last td element in this row - total time
    total_time = rider.parent.find_all("td")[-1].get_text()

    # getting laps from the next row from the current
    laps = [td.get_text() for td in rider.parent.find_next_sibling('tr').select("table.laps tr td")]

    data.append([rider_name, total_time] + laps)

pprint(data)

打印:

[['RIDER 1', '01:14:20', '00:26:36', '00:19:51', '00:27:54'],
 ['RIDER 2', '00:41:06', '00:19:10', '00:21:57'],
 ['RIDER 3', '00:36:59', '00:37:00'],
 ['RIDER 4', '01:26:41', '01:26:42']]

而且,由于您的真实HTML中有多个表,因此您需要逐个表处理它。工作示例:

from pprint import pprint

import requests
from bs4 import BeautifulSoup

url = "http://www.kr3w.co.uk/downloads/test.htm"
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36"}).content
soup = BeautifulSoup(response, "html.parser")


for table in soup.select("table.results"):
    title = table.find_previous_sibling("h1").get_text()

    data = []
    for rider in table.select("td.rider"):  # all td elements having `rider` class
        rider_name = rider.get_text()
        # getting the last td element in this row - total time
        total_time = rider.parent.find_all("td")[-1].get_text()

        # getting laps from the next row from the current
        laps = [td.get_text() for td in rider.parent.find_next_sibling('tr').select("table.laps tr td")]

        data.append([rider_name, total_time] + laps)

    print(title)
    pprint(data)
    print("-----")