我正在尝试抓取一个没有标准化输出并且没有任何样式/ id标签的旧网站,它们只是显示如下:
<table BORDER="0" VALIGN="top" CELLPADDING="3" CELLSPACING="0" WIDTH="100%">
<tr>
<td ALIGN="left" VALIGN="top" WIDTH="175">
<strong>Surname</strong>
</td>
<td valign="top">
Bloggs
</td>
</tr>
<tr>
<td ALIGN="left" VALIGN="top" WIDTH="175">
<strong>Forename(s)</strong>
</td>
<td valign="top">
Joe
</td>
</tr>
<tr>
<td ALIGN="left" VALIGN="top" WIDTH="175">
<strong>Title</strong>
</td>
<td valign="top">
Mr
</td>
</tr>
<tr>
<td ALIGN="left" VALIGN="top" WIDTH="175">
<strong>Gender</strong>
</td>
<td valign="top">
Male
</td>
</tr>
<tr>
<td ALIGN="left" VALIGN="top" WIDTH="175">
<strong>Occupation</strong>
</td>
<td valign="top">
</td>
</tr>
<tr>
<td ALIGN="left" VALIGN="top" WIDTH="175">
<strong>Date of Birth</strong>
</td>
<td valign="top">
13/05/12
</td>
</tr>
</table>
问题是,如果数据库中不存在某个字段,它甚至不显示空行,则会在两个核心数据表之间添加一些额外数据作为额外表,并且在发生这种情况时没有指示。 / p>
我使用Python的方法有点长,但想法是将左侧TD作为标题进行验证并抓住正确的TD,这是相关数据,如下所示:
title, forename, surname, gender, occupation, dob = '', '', '', '', '', ''
tbl1 = soup.findAll('table')[1]
for tr in tbl1.findAll('tr'):
content = tr.findAll('td')
if content[0].text.strip() == 'Title':
title = content[1].text.strip()
if content[0].text.strip() == 'Forename(s)':
forename = content[1].text.strip()
if content[0].text.strip() == 'Surname':
surname = content[1].text.strip()
if content[0].text.strip() == 'Gender':
gender = content[1].text.strip()
if content[0].text.strip() == 'Occupation':
occupation = content[1].text.strip()
if content[0].text.strip() == 'Date of Birth':
dob = content[1].text.strip()
print('"' + title + '","' + forename + '","' + surname + '","' + gender + '","' + occupation + '","' + dob + '"')
每当我尝试迭代所有表格时,我得到: AttributeError:ResultSet对象没有属性'findAll'。您可能正在处理像单个项目的项目列表。当你打算调用find()时,你调用了find_all()吗?
答案 0 :(得分:0)
您可以创建标题列表并使用itertools.izip_longest
:
import itertools
import re
headers = ['title', 'forename', 'surname', 'gender', 'occupation', 'dob']
from bs4 import BeautifulSoup as soup
s = soup(web_data, 'lxml')
new_s = [re.sub('\n+|\t+', '', i.text) for i in s.findAll('td')]
final_data = {a:b for a, b in itertools.izip_longest(headers, [c for i, c in enumerate(new_s) if i%2 != 0])}
输出:
{'surname': u'Mr', 'title': u'Bloggs', 'dob': u'13/05/12', 'gender': u'Male', 'forename': u'Joe', 'occupation': u''}