所以我要抓取一个网站,基本上我想在字典中存储一些表数据。
以下是我要抓的程序-
from bs4 import BeautifulSoup
from collections import defaultdict
import json
import requests
import re
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for row in my_table.find_all('tr'):
try:
name, div_rank, gender_rank, overall_rank, swim, bike, run,
total_time = (col.text.strip() for col in row.find_all('td')[2:])
except ValueError:
continue
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
'total_time': total_time
})
return result
print(json.dumps(parse_table(soup), indent=3))
我检查了print(my_table)
是否为空,但是如果我检查print(my_table.find_all('tr'))
,则为空。我需要的所有数据都位于td
标记内的tr
标记内。
为什么my_table
返回空?
编辑:find_all('tr')
的输出基本上是一堆print(my_table)
类似-
tr
答案 0 :(得分:2)
问题在于,表中的实际返回HTML包含注释,而不是行(可能使刮板受挫?)。除此之外,还有多个Python错误。如果我们捕获tbody
,然后从中提取注释(其中包含真实数据),则可以将注释解析为HTML表。
这些行没有按照在浏览器中查看HTML时出现的顺序排列,我想一旦将它们转换为注释后,它们就会被重新排列。无论如何,我们然后访问源中组织的数据,这些数据与浏览器中显示的数据不同。总的时间似乎没有包含在表中,我想像一下将注释转换为表的JavaScript代码即可计算出该时间,因此您可能必须自己计算(在此不再赘述)。
代码
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import json
import requests
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank, overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
# 'total_time': total_time
})
return result
print(json.dumps(parse_table(soup), indent=3))
哪个会带给您各种各样的条目(我只显示几个):
{
"Goodlad, Martin 977": [
{
"div_rank": "156",
"gender_rank": "899",
"overall_rank": "1026",
"swim": "00:57:56",
"bike": "05:00:29",
"run": "04:20:04"
}
],
"Maley, Joel 1840": [
{
"div_rank": "39",
"gender_rank": "171",
"overall_rank": "186",
"swim": "01:12:01",
"bike": "04:34:59",
"run": "03:17:13"
}
]
}
答案 1 :(得分:0)
问题出在html.parser
上。
您可以安装lxml
并将代码更改为以下内容吗?
soup = BeautifulSoup(data, 'lxml')
输出:
defaultdict( < class 'list' > ,
{'1': [{'div_rank': '1', 'gender_rank': '1', 'overall_rank': '00:50:37', 'swim': '04:16:04', 'bike': '02:41:31', 'run': '07:52:39', 'total_time': '5000'}],
'2': [{'div_rank': '2', 'gender_rank': '2', 'overall_rank': '00:54:07', 'swim': '04:12:25', 'bike': '02:45:41', 'run': '07:56:41', 'total_time': '4951'}],
'3': [{'div_rank': '3', 'gender_rank': '3', 'overall_rank': '00:49:31', 'swim': '04:21:18', 'bike': '02:46:03', 'run': '08:01:09', 'total_time': '4898'}],
'4': [{'div_rank': '4', 'gender_rank': '4', 'overall_rank': '00:47:45', 'swim': '04:18:45', 'bike': '02:52:33', 'run': '08:03:17', 'total_time': '4872'}],
'5': [{'div_rank': '5', 'gender_rank': '5', 'overall_rank': '00:49:28', 'swim': '04:17:17', 'bike': '02:53:38', 'run': '08:04:41', 'total_time': '4855'}],
'6': [{'div_rank': '6', 'gender_rank': '6', 'overall_rank': '00:54:02', 'swim': '04:12:58', 'bike': '02:52:56', 'run': '08:04:45', 'total_time': '4854'}],
'7': [{'div_rank': '7', 'gender_rank': '7', 'overall_rank': '00:50:53', 'swim': '04:15:41', 'bike': '02:54:15', 'run': '08:05:54', 'total_time': '4841'}],
'8': [{'div_rank': '8', 'gender_rank': '8', 'overall_rank': '00:49:33', 'swim': '04:18:51', 'bike': '02:56:27', 'run': '08:09:34', 'total_time': '4797'}],
'9': [{'div_rank': '9', 'gender_rank': 'overall_rank': '00:50:51', 'swim': '04:09:06', 'bike': '03:06:18', 'run': '08:10:32', 'total_time': '4785'}],
'10': [{'div_rank': '10', 'gender_rank': '10', 'overall_rank': '00:54:14', 'swim': '04:11:27', 'bike': '03:00:02', 'run': '08:11:04', 'total_time': '4779'}],
'11': [{'div_rank': '11', 'gender_rank': '11', 'overall_rank': '00:47:46', 'swim': '04:19:44', 'bike': '02:59:24', 'run': '08:11:41', 'total_time': '4771'}],
'12': [{'div_rank': '12', 'gender_rank': '12', 'overall_rank': '00:50:39', 'swim': '04:27:47', 'bike': '02:50:36', 'run': '08:13:47', 'total_time': '4746'}],
'13': [{'div_rank': '13', 'gender_rank': '13', 'overall_rank': '00:50:56', 'swim': '04:15:17', 'bike': '03:02:50', 'run': '08:14:02', 'total_time': '4743'}],
'14': [{'div_rank': '14', 'gender_rank': '14', 'overall_rank': '00:50:48', 'swim': '04:19:48', 'bike': '02:58:04', 'run': '08:14:31', 'total_time': '4737'}],
'15': [{'div_rank': '15', 'gender_rank': '15', 'overall_rank': '00:50:39', 'swim': '04:19:58', 'bike': '03:00:17', 'run': '08:15:58', 'total_time': '4720'}],
'16': [{'div_rank': '16', 'gender_rank': '16', 'overall_rank': '00:50:45', 'swim': '04:25:04', 'bike': '02:57:35', 'run': '08:17:54', 'total_time': '4697'}],
'17': [{'div_rank': '17', 'gender_rank': '17', 'overall_rank': '00:50:41', 'swim': '04:21:02', 'bike': '03:02:00', 'run': '08:18:18', 'total_time': '4692'}],
'18': [{'div_rank': '18', 'gender_rank': '18', 'overall_rank': '00:50:45', 'swim': '04:19:56', 'bike': '03:03:47', 'run': '08:19:13', 'total_time': '4681'}],
'19': [{'div_rank': '19', 'gender_rank': '19', 'overall_rank': '00:47:43', 'swim': '04:19:01', 'bike': '03:08:42', 'run': '08:19:40', 'total_time': '4675'}],
'20': [{'div_rank': '20', 'gender_rank': '20', 'overall_rank': '00:47:51', 'swim': '04:18:38', 'bike': '03:10:07', 'run': '08:21:52', 'total_time': '4649'}]})