我想使用BeautifulSoup从网站上抓取一些数据。数据位于一个表中,其中不同的表行总共为4个不同的类。
<table class="timeteam">
<tbody>
<tr class="even"></tr>
<tr class="even smallrow"></tr>
<tr class="odd"></tr>
<tr class="odd smallrow"></tr>
</tbody>
</table>
'even'和'odd'行中的数据属于一起。所以我想在Dataframe的末尾获得这两行(以及其他行)。
使用find_all('tr',class _ = ['even','odd'])我也得到其他行(带有smallrow)。因此我尝试了重新编译功能。但结果仍然相同。
我需要在代码中更改哪些内容才能选择“偶数”和“奇数”类的行?
这是我的代码:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup as bs
page = request.get('https://regatta.time-team.nl/hollandia/2017/results/003.php')
soup = bs(page.content, 'html.parser')
tables = soup.find_all('table', class_='timeteam')
player_data_even = []
player_data_smallrow = []
for i in range(len(tables)):
for tr in tables[i].find_all('tr', class_ = re.compile(r"^(even|odd)$")):
player_row_even = []
for td in tr.find_all('td'):
player_row_even.append(td.get_text())
player_data_even.append(player_row_even)
for tr in tables[i].find_all('tr', class_=['even smallrow', 'odd smallrow']):
player_row_smallrow = []
for td in tr.find_all('td'):
player_row_smallrow.append(td.get_text())
player_data_smallrow.append(player_row_smallrow)
players_even = pd.DataFrame(player_data_even)
players_smallrow = pd.DataFrame(player_data_smallrow)
答案 0 :(得分:0)
您可以使用检查包含类的列表是否长度为1:
from bs4 import BeautifulSoup as soup
content = """
<table class="timeteam">
<tbody>
<tr class="even">yes</tr>
<tr class="even smallrow">no</tr>
<tr class="odd">yes</tr>
<tr class="odd smallrow">no</tr>
</tbody>
</table>
"""
s = [i.text for i in soup(content, 'html.parser').find_all('tr') if len(i['class']) == 1]
输出:
['yes', 'yes']
答案 1 :(得分:0)
好的,似乎你的问题已经解决了。您现在可以按顺序获取所有数据。试一试:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://regatta.time-team.nl/hollandia/2017/results/003.php")
soup = BeautifulSoup(res.text,"lxml")
for table in soup.find(class_="timeteam").find_all("tr",class_=['even','odd']):
if "smallrow" not in table.get('class'): #this is the fix
data = [item.get_text(strip=True) for item in table]
print(data)
输出你可能会得到:
['1.', 'PHO', 'Phocas 1 (p2)', '', '--', '', '--', '', '--', '', '06:39,86', '(1)', 'KF']
['2.', 'PAM', 'Pampus (p4)', '', '--', '', '--', '', '--', '', '06:45,21', '(2)', 'KF']
['3.', 'SKO', 'Skøll 1', '', '--', '', '--', '', '--', '', '06:46,23', '(3)', 'KF']
['4.', 'NJO', 'Njord (p1)', '', '--', '', '--', '', '--', '', '06:49,44', '(4)', 'KF']
['5.', 'GYA', 'Gyas (SB)', '', '--', '', '--', '', '--', '', '06:50,04', '(5)', 'KF']
['6.', 'PRO', 'Proteus 1 (p7)', '', '--', '', '--', '', '--', '', '06:50,24', '(6)', 'KF']